[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166494 (owner: 10TrainBranchBot) [00:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 [00:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 (owner: 10TrainBranchBot) [00:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1166530 (owner: 10TrainBranchBot) [00:51:07] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [01:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:25:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:30:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [03:29:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:37:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29361 bytes in 4.056 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:42:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:43:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 2.880 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:46:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:47:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 7.267 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:50:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [03:51:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29770 bytes in 0.837 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:21:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 3.216 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:24:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:29:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29772 bytes in 7.457 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:35:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:44:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29507 bytes in 3.630 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:51:07] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [05:04:37] (03PS1) 10KartikMistry: WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [05:06:35] (03CR) 10CI reject: [V:04-1] WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) (owner: 10KartikMistry) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:29] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 2 (gerrit1003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:22:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 6.472 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:25:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:26:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29872 bytes in 6.695 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:29:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:30:51] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 2.899 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:53:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:58:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29771 bytes in 4.479 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:02:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:12:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29770 bytes in 0.496 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:27] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:22:21] (03PS2) 10KartikMistry: WIP: machinetranslation: Use s3 for model download in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166543 (https://phabricator.wikimedia.org/T335491) [06:24:17] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (604888s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [06:28:38] (03PS2) 10Nikerabbit: Remove chararacterEditStatsTranslate [puppet] - 10https://gerrit.wikimedia.org/r/1164956 (https://phabricator.wikimedia.org/T398171) [06:29:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [06:29:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [06:43:48] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:54:58] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [06:56:17] (03CR) 10Marostegui: [C:03+2] mariadb: Exclude tmpfs and ramfs from paging disk monitor alerts [puppet] - 10https://gerrit.wikimedia.org/r/1166377 (https://phabricator.wikimedia.org/T398275) (owner: 10Jcrespo) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:00] (03PS1) 10Brouberol: mediawiki-dumps-legacy: define the globalusage.dblist file in the dblists configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166670 (https://phabricator.wikimedia.org/T398788) [07:04:46] !log testing haproxy 2.8.15 in cp5017 and cp5025 - T398720 [07:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:48] T398720: Upgrade to haproxy 2.8.15 - https://phabricator.wikimedia.org/T398720 [07:06:00] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: define the globalusage.dblist file in the dblists configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166670 (https://phabricator.wikimedia.org/T398788) (owner: 10Brouberol) [07:10:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:11:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:13:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x1 T397612 [07:13:59] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:20:48] (03PS13) 10Vgutierrez: cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) [07:21:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1220 with weight 0 T397612', diff saved to https://phabricator.wikimedia.org/P78760 and previous config saved to /var/cache/conftool/dbconfig/20250707-072157-root.json [07:22:01] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:22:28] (03CR) 10Giuseppe Lavagetto: [C:03+1] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [07:23:06] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1220 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1162851 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:24:41] (03CR) 10Vgutierrez: [C:03+2] cache,haproxy: Remove http response captures [puppet] - 10https://gerrit.wikimedia.org/r/1166167 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [07:24:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:25:20] !log Starting x1 eqiad failover from db1237 to db1220 - T397612 [07:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:42] !log depooling cp7006 to test Ia82b9354a5b9e7bd5443b4af0888325919ddb19e - T397917 [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:47] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [07:25:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29769 bytes in 1.375 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [07:29:02] (03CR) 10Gmodena: [C:03+1] Revert^2 "Clean up EventBus and jobs config" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [07:37:17] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:37:47] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:40:23] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:41:21] ^ we are on that [07:42:47] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:45:23] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:47:17] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [07:47:37] (03CR) 10Volans: New structure for sshd_config starting with trixie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [07:50:00] (03CR) 10Marostegui: [C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:50:05] (03PS2) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) [07:50:15] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1162852 (https://phabricator.wikimedia.org/T397612) (owner: 10Gerrit maintenance bot) [07:50:21] !log marostegui@dns1006 START - running authdns-update [07:50:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:51:26] !log marostegui@dns1006 END - running authdns-update [07:52:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1220 to x1 primary and set section read-write T397612', diff saved to https://phabricator.wikimedia.org/P78762 and previous config saved to /var/cache/conftool/dbconfig/20250707-075254-root.json [07:53:00] T397612: Switchover x1 master (db1237 -> db1220) - https://phabricator.wikimedia.org/T397612 [07:53:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1237 T397612', diff saved to https://phabricator.wikimedia.org/P78763 and previous config saved to /var/cache/conftool/dbconfig/20250707-075308-root.json [07:53:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:53:23] !log repooling cp7006 with Ia82b9354a5b9e7bd5443b4af0888325919ddb19e applied - T397917 [07:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:25] T397917: Append requestctl rule name to X-Analytics header in HAProxy - https://phabricator.wikimedia.org/T397917 [07:58:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:58:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29781 bytes in 7.826 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:59:10] (03PS1) 10Giuseppe Lavagetto: Code changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166749 [07:59:28] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Code changes: [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1166749 (owner: 10Giuseppe Lavagetto) [08:00:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1237.eqiad.wmnet with reason: Maintenance [08:00:56] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Feature: logging of deny actions; add rename functionality - oblivian@cumin1003" [08:00:58] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: logging of deny actions; add rename functionality - oblivian@cumin1003 [08:01:30] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Feature: logging of deny actions; add rename functionality - oblivian@cumin1003 [08:01:31] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Feature: logging of deny actions; add rename functionality - oblivian@cumin1003" [08:02:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:05:12] (03PS1) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 [08:09:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29855 bytes in 0.721 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:11:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:21] (03CR) 10Ayounsi: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [08:14:01] (03CR) 10Muehlenhoff: New structure for sshd_config starting with trixie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [08:14:10] (03PS9) 10Muehlenhoff: New structure for sshd_config starting with trixie [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [08:15:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1237.eqiad.wmnet with reason: Maintenance [08:17:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [08:17:48] (03CR) 10Volans: reimage: temporarily store the MAC in Netbox (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [08:19:17] 10ops-eqiad, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794 (10Marostegui) 03NEW [08:19:24] 10ops-eqiad, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10977707 (10Marostegui) p:05Triageβ†’03Medium [08:28:58] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 from active Puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1166160 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [08:32:55] (03PS1) 10Klausman: hiera/deployment-server: change name of MT AWS user [labs/private] - 10https://gerrit.wikimedia.org/r/1166754 (https://phabricator.wikimedia.org/T335491) [08:33:07] (03Abandoned) 10Michael Große: tests: skip test to allow updating CommunityConfigurationExample [extensions/CommunityConfiguration] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166226 (https://phabricator.wikimedia.org/T398624) (owner: 10Michael Große) [08:34:23] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:34:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [08:35:20] (03CR) 10Cathal Mooney: [C:03+2] Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:35:53] (03Merged) 10jenkins-bot: Switch BGP: Automate & unify IBGP configs on switches [homer/public] - 10https://gerrit.wikimedia.org/r/1154319 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [08:36:14] (03Merged) 10jenkins-bot: zarcillo: Update egress to idp.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166227 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [08:40:36] (03CR) 10Elukey: [C:03+1] hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [08:42:44] (03CR) 10Elukey: [C:03+1] hiera/deployment-server: change name of MT AWS user [labs/private] - 10https://gerrit.wikimedia.org/r/1166754 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [08:43:11] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/deployment-server: change name of MT AWS user [labs/private] - 10https://gerrit.wikimedia.org/r/1166754 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [08:44:18] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver role frm puppetserver2003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/1166161 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [08:46:08] (03PS2) 10Majavah: hieradata: Enable hourly logrotate in all cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166146 (https://phabricator.wikimedia.org/T273734) [08:46:25] FIRING: [3x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:29] (03PS1) 10Muehlenhoff: Fix site.pp entry for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1166758 [08:47:34] (03CR) 10Majavah: [C:03+2] hieradata: Enable hourly logrotate in all cloudgws [puppet] - 10https://gerrit.wikimedia.org/r/1166146 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [08:48:27] (03CR) 10Muehlenhoff: [C:03+2] Fix site.pp entry for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1166758 (owner: 10Muehlenhoff) [08:51:07] FIRING: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [08:53:47] (03PS11) 10Fabfur: cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [08:53:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:55:54] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/thanos-swift: Fix MinT user [puppet] - 10https://gerrit.wikimedia.org/r/1164235 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [08:56:25] FIRING: [3x] SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:31] (03PS1) 10Marostegui: dbproxy1023,dbproxy1025: Test db1228 [puppet] - 10https://gerrit.wikimedia.org/r/1166759 (https://phabricator.wikimedia.org/T397633) [09:00:44] (03CR) 10Marostegui: [C:03+2] dbproxy1023,dbproxy1025: Test db1228 [puppet] - 10https://gerrit.wikimedia.org/r/1166759 (https://phabricator.wikimedia.org/T397633) (owner: 10Marostegui) [09:02:36] (03PS1) 10Marostegui: Revert "dbproxy1023,dbproxy1025: Test db1228" [puppet] - 10https://gerrit.wikimedia.org/r/1166760 [09:02:42] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1166760 (owner: 10Marostegui) [09:03:39] (03CR) 10Marostegui: Revert "dbproxy1023,dbproxy1025: Test db1228" [puppet] - 10https://gerrit.wikimedia.org/r/1166760 (owner: 10Marostegui) [09:03:42] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy1023,dbproxy1025: Test db1228" [puppet] - 10https://gerrit.wikimedia.org/r/1166760 (owner: 10Marostegui) [09:04:23] (03PS1) 10Majavah: toolforge: Simplify wmcs-wheel-of-misfortune [puppet] - 10https://gerrit.wikimedia.org/r/1166762 [09:04:35] (03PS1) 10Alexandros Kosiaris: Kubernetes: Switch MTU for all clusters to 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1166763 (https://phabricator.wikimedia.org/T352956) [09:05:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10977972 (10dcaro) >>! In T394333#10973902, @ayounsi wrote: > There is currently only one switch per rack, so I suggest we only use one uplink for now, and... [09:05:41] (03PS12) 10Fabfur: cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [09:06:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2233].codfw.wmnet,db[1217,1228,1250].eqiad.wmnet with reason: maintenance [09:06:46] (03CR) 10CI reject: [V:04-1] toolforge: Simplify wmcs-wheel-of-misfortune [puppet] - 10https://gerrit.wikimedia.org/r/1166762 (owner: 10Majavah) [09:07:17] (03CR) 10Alexandros Kosiaris: [C:03+2] Kubernetes: Switch MTU for all clusters to 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1166763 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [09:07:37] (03CR) 10Alexandros Kosiaris: [C:03+2] "After all the other clusters have functioned fine for quite a while, we can switch the default." [puppet] - 10https://gerrit.wikimedia.org/r/1166763 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [09:07:42] (03PS2) 10Majavah: toolforge: Simplify wmcs-wheel-of-misfortune [puppet] - 10https://gerrit.wikimedia.org/r/1166762 [09:09:14] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [09:09:34] (03PS1) 10Marostegui: mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166764 (https://phabricator.wikimedia.org/T397633) [09:09:53] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:10:50] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1166764 (https://phabricator.wikimedia.org/T397633) (owner: 10Marostegui) [09:12:35] (03CR) 10Hnowlan: [C:03+2] Remove chararacterEditStatsTranslate [puppet] - 10https://gerrit.wikimedia.org/r/1164956 (https://phabricator.wikimedia.org/T398171) (owner: 10Nikerabbit) [09:13:52] !log Failover m2 from db1250 to db1228 - T397633 [09:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:55] T397633: Switchover m2 master db1250 -> db1228 - https://phabricator.wikimedia.org/T397633 [09:18:42] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [09:21:37] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1250.eqiad.wmnet with reason: Maintenance [09:23:54] (03PS1) 10Marostegui: db1250: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166768 (https://phabricator.wikimedia.org/T397602) [09:24:53] (03CR) 10Marostegui: [C:03+2] db1250: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166768 (https://phabricator.wikimedia.org/T397602) (owner: 10Marostegui) [09:25:04] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [09:26:55] (03Abandoned) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) (owner: 10Fabfur) [09:28:39] (03CR) 10Fabfur: [C:03+2] cache,haproxy: remove old ipblock map files [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [09:29:04] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10978069 (10Ladsgroup) 05Openβ†’03Resolved a:03Ladsgroup I was about to do one but it turned out it automatically got optimized... [09:34:15] (03PS1) 10Vgutierrez: cache::haproxy: Fix requestctl= sanitization [puppet] - 10https://gerrit.wikimedia.org/r/1166775 (https://phabricator.wikimedia.org/T397917) [09:38:12] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:38:50] (03CR) 10Jelto: [C:03+1] "looks good to me, it seems all the logic has been moved to `"sre.gerrit.topology-check` cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:39:12] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:40:12] (03CR) 10Arnaudb: [C:03+2] gerrit: sanity checks as a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165544 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:40:26] (03CR) 10Arnaudb: [C:03+2] gerrit: sanity checks cookbook implementation [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:40:36] (03PS1) 10Marostegui: dbconfig.schema: Add x1 [puppet] - 10https://gerrit.wikimedia.org/r/1166776 [09:42:21] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [09:43:53] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [09:45:19] (03CR) 10Ladsgroup: [C:03+1] dbconfig.schema: Add x1 [puppet] - 10https://gerrit.wikimedia.org/r/1166776 (owner: 10Marostegui) [09:45:49] (03CR) 10Marostegui: "Why those specific tables and not all?" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:46:34] (03CR) 10Marostegui: [C:03+1] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:46:56] (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add x1 [puppet] - 10https://gerrit.wikimedia.org/r/1166776 (owner: 10Marostegui) [09:47:15] (03Merged) 10jenkins-bot: gerrit: sanity checks cookbook implementation [cookbooks] - 10https://gerrit.wikimedia.org/r/1165880 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:49:42] (03CR) 10Ladsgroup: "Very good question: The answer is that because none of the remaining ones are cataloged (not even cataloged but under different visibility" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:52:10] (03CR) 10Volans: [C:04-1] "Suggested way to test it as requested on IRC. There is also a small bug in the implementation." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [09:52:59] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1166775 (https://phabricator.wikimedia.org/T397917) (owner: 10Vgutierrez) [09:53:09] (03CR) 10Marostegui: [C:03+1] "That's fine, I just wanted to make sure there was a reason not to leave all those behind." [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [09:58:09] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [09:58:41] (03CR) 10Ayounsi: Netbox: expose the switches a server is connected to (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [09:59:18] (03PS4) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1000) [10:00:23] (03CR) 10Giuseppe Lavagetto: "Overall lgtm, with a minor comment." [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:00:40] (03CR) 10Ayounsi: Netbox: expose the switches a server is connected to (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [10:01:07] (03CR) 10Giuseppe Lavagetto: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:02:50] (03PS1) 10Marostegui: db2234: Migrate to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166780 (https://phabricator.wikimedia.org/T398805) [10:06:32] (03CR) 10Vgutierrez: "that's a nice catch, we got several issues here actually:" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:07:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:00] (03CR) 10CI reject: [V:04-1] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [10:08:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:21] !log root@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [10:09:27] (03PS2) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 [10:10:36] (03PS1) 10Hamish: mrwiki: Correct draft namespace spelling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166783 (https://phabricator.wikimedia.org/T398792) [10:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:13:16] !log root@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [10:13:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:16:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10978348 (10Clement_Goubert) a:05Clement_Goubertβ†’03None [10:17:15] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [10:18:40] (03PS3) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 [10:19:05] (03CR) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:19:39] (03PS3) 10Cathal Mooney: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 [10:21:49] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:22:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:34] (03CR) 10Vgutierrez: cache::haproxy: Use a separate site for port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:22:43] (03CR) 10Vgutierrez: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:24:08] !log remove swift-account-stats_machinetranslation:prod time & service from thanos-fe1004 T335491 [10:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:10] T335491: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 [10:26:15] (03CR) 10Hnowlan: [C:03+1] changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [10:26:26] (03PS5) 10Ayounsi: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 [10:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:27:09] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [10:27:25] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:58] (03PS1) 10Majavah: hieradata: Enable NAT logging on cloudgw1004 [puppet] - 10https://gerrit.wikimedia.org/r/1166787 (https://phabricator.wikimedia.org/T273734) [10:28:48] (03PS13) 10Fabfur: cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [10:29:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: VC link from asw2-c4-eqiad to asw2-c7-eqiad flapping - https://phabricator.wikimedia.org/T398612#10978400 (10cmooney) 05Openβ†’03Resolved Link remains stable, closing task. [10:29:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6146/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166787 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [10:30:15] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [10:30:19] (03CR) 10Fabfur: [C:04-1] "not a problem per se but I think there's already I3f6cd56d6058262cae27aa1a7523836cb0a6965e for this (I integrated your change in the lates" [puppet] - 10https://gerrit.wikimedia.org/r/1165582 (https://phabricator.wikimedia.org/T329332) (owner: 10CDanis) [10:30:57] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [10:33:00] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [10:34:27] (03CR) 10Cathal Mooney: [C:03+2] Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 (owner: 10Cathal Mooney) [10:34:34] (03PS1) 10Majavah: P:wmcs::metricsinfra: Hide 'logger' receiver from Karma [puppet] - 10https://gerrit.wikimedia.org/r/1166788 [10:34:58] (03Merged) 10jenkins-bot: Use VC status to derive l3_switch variable and remove from YAML [homer/public] - 10https://gerrit.wikimedia.org/r/1166390 (owner: 10Cathal Mooney) [10:35:58] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS bookworm [10:37:27] (03PS1) 10Hnowlan: api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) [10:39:39] (03PS3) 10Hnowlan: ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804) [10:40:34] (03PS2) 10Cathal Mooney: Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) [10:41:31] (03PS1) 10Tchanders: temp accounts: Separate digits in user names with hyphens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) [10:42:25] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:11] (03CR) 10Marostegui: [C:03+2] db2234: Migrate to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1166780 (https://phabricator.wikimedia.org/T398805) (owner: 10Marostegui) [10:45:00] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet with reason: Maintenance [10:45:15] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS bookworm [10:48:39] (03CR) 10Elukey: [C:03+2] redfish: add support for iDRAC 10 to force_http_boot_once [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [10:49:03] (03CR) 10Elukey: [C:03+2] Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:50:46] (03PS1) 10Majavah: P:toolforge::prometheus: Add source_labels to severity relabel rule [puppet] - 10https://gerrit.wikimedia.org/r/1166793 (https://phabricator.wikimedia.org/T396038) [10:53:15] (03CR) 10ClΓ©ment Goubert: [C:03+1] ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [10:53:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1217.eqiad.wmnet with reason: Maintenance [10:56:27] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [10:56:54] (03PS2) 10Majavah: P:wmcs::metricsinfra: Hide 'logger' receiver from Karma [puppet] - 10https://gerrit.wikimedia.org/r/1166788 (https://phabricator.wikimedia.org/T398812) [10:57:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:58:29] (03Merged) 10jenkins-bot: redfish: add support for iDRAC 10 to force_http_boot_once [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166371 (https://phabricator.wikimedia.org/T393044) (owner: 10Elukey) [10:58:31] (03Merged) 10jenkins-bot: Add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:01:40] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:01:44] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [11:02:34] !log installing modsecurity-apache security updates [11:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:10] haproxy alerts are expected [11:05:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1250.eqiad.wmnet with reason: Maintenance [11:06:22] !log root@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [11:08:58] (03PS1) 10Btullis: Stop sqooping the ipblocks table, since it no longer exists. [puppet] - 10https://gerrit.wikimedia.org/r/1166796 (https://phabricator.wikimedia.org/T398602) [11:10:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6147/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166796 (https://phabricator.wikimedia.org/T398602) (owner: 10Btullis) [11:10:16] !log root@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [11:10:26] (03CR) 10Btullis: Stop sqooping the ipblocks table, since it no longer exists. [puppet] - 10https://gerrit.wikimedia.org/r/1166796 (https://phabricator.wikimedia.org/T398602) (owner: 10Btullis) [11:10:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:10:47] (03PS1) 10Marostegui: mariadb: Add db1259 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1166797 (https://phabricator.wikimedia.org/T393296) [11:13:09] (03CR) 10Marostegui: [C:03+2] mariadb: Add db1259 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1166797 (https://phabricator.wikimedia.org/T393296) (owner: 10Marostegui) [11:13:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10978517 (10Marostegui) @VRiley-WMF I've created the puppet patches for db1259. [11:15:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:28] (03PS1) 10Giuseppe Lavagetto: cache-text: start removing static rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1166798 (https://phabricator.wikimedia.org/T398668) [11:19:54] (03CR) 10Cathal Mooney: [C:03+2] Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) (owner: 10Cathal Mooney) [11:21:16] (03Merged) 10jenkins-bot: Use IP address not hostname for syslog dest on L2 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1166391 (https://phabricator.wikimedia.org/T398690) (owner: 10Cathal Mooney) [11:23:21] jouncebot: nowandnext [11:23:21] No deployments scheduled for the next 1 hour(s) and 36 minute(s) [11:23:21] In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1300) [11:23:26] wow [11:23:35] (03CR) 10Ladsgroup: [C:03+2] Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [11:24:29] (03Merged) 10jenkins-bot: Revert^2 "Clean up EventBus and jobs config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1165169 (owner: 10Ladsgroup) [11:25:07] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1165169|Revert^2 "Clean up EventBus and jobs config"]] [11:28:33] FIRING: [3x] GnmiTargetDown: lsw1-d3-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [11:29:58] (03CR) 10Ayounsi: [C:03+2] Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [11:30:21] (03CR) 10Ayounsi: [C:03+2] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [11:39:45] (03PS1) 10Elukey: Release version 0.0.16 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) [11:40:23] (03Merged) 10jenkins-bot: Netbox: expose the switches a server is connected to [software/spicerack] - 10https://gerrit.wikimedia.org/r/1166403 (owner: 10Ayounsi) [11:40:23] (03CR) 10CI reject: [V:04-1] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [11:41:02] (03CR) 10Elukey: Release version 0.0.16 (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:42:20] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS trixie [11:42:40] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:42:44] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:46:21] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1165169|Revert^2 "Clean up EventBus and jobs config"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:47:23] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:51:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh6001.wikimedia.org to drbd [11:53:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:18] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:54:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2146 T398433', diff saved to https://phabricator.wikimedia.org/P78771 and previous config saved to /var/cache/conftool/dbconfig/20250707-115457-ladsgroup.json [11:55:00] T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433 [11:56:22] !log elukey@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2006.codfw.wmnet with OS trixie [11:56:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [11:57:42] (03PS1) 10Ladsgroup: Use dblist for wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 [11:58:27] (03CR) 10CI reject: [V:04-1] Use dblist for wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 (owner: 10Ladsgroup) [11:58:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:59:02] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2146.codfw.wmnet with reason: Just in case (T398433) [11:59:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh6001.wikimedia.org to drbd [11:59:36] PROBLEM - Host doh6001 is DOWN: PING CRITICAL - Packet loss = 100% [11:59:48] RECOVERY - Host doh6001 is UP: PING OK - Packet loss = 0%, RTA = 87.51 ms [12:00:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum6001.drmrs.wmnet to drbd [12:00:13] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1165169|Revert^2 "Clean up EventBus and jobs config"]] (duration: 35m 06s) [12:01:26] (03PS2) 10Ladsgroup: Use dblist for wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 [12:02:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.201s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:02:18] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:02:18] !log akosiaris@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2042.codfw.wmnet [12:02:54] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2042.codfw.wmnet [12:03:05] !log akosiaris@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2046.codfw.wmnet [12:03:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:41] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2046.codfw.wmnet [12:03:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:04:07] (03CR) 10Ladsgroup: [C:03+2] Use dblist for wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 (owner: 10Ladsgroup) [12:04:47] !log reboot lsw1-a8-codfw - T398433 [12:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:50] T398433: lsw1-a8-codfw: fpc0 PFE Statistics received unknown trigger (type Semaphore, id 0) - https://phabricator.wikimedia.org/T398433 [12:05:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 (owner: 10Ladsgroup) [12:05:13] System going down in 1 minute [12:05:18] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:05:55] (03Merged) 10jenkins-bot: Use dblist for wikilove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166804 (owner: 10Ladsgroup) [12:06:10] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1166804|Use dblist for wikilove]] [12:06:58] PROBLEM - Host lsw1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [12:07:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.01s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:08:04] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1166804|Use dblist for wikilove]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:08:08] 06SRE, 06Infrastructure-Foundations, 10netops: DNS resolution not working on Juniper virtual-chassis switches eqiad - https://phabricator.wikimedia.org/T398690#10978636 (10cmooney) 05Openβ†’03Declined Gonna close this one for now, we only have a small number of these switches left and we are planning t... [12:08:10] PROBLEM - Host lsw1-a8-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:08:10] PROBLEM - Host lsw1-a8-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:16] (03CR) 10Dreamy Jazz: "Local wiki consensus wants admins to no longer have the ability to grant access." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [12:09:02] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:09:02] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:09:18] RECOVERY - Host lsw1-a8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.71 ms [12:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a8-codfw (10.192.252.10) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:09:51] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/7 (Core: lsw1-a8-codfw:et-0/0/55 {#230403800025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:10:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum6001.drmrs.wmnet to drbd [12:10:30] PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100% [12:10:44] RECOVERY - Host lsw1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms [12:10:48] RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 87.53 ms [12:10:54] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:11:02] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:11:02] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:11:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#10978657 (10Jclark-ctr) a:03Jclark-ctr Dell Ticket opened SR212445188 [12:12:00] PROBLEM - Bird Internet Routing Daemon on durum6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:12:16] (03PS1) 10Ladsgroup: Revert "Increase max db connection count before circuit breaking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166808 [12:12:58] (03PS2) 10Ladsgroup: Revert "Increase max db connection count before circuit breaking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166808 (https://phabricator.wikimedia.org/T398692) [12:13:12] RECOVERY - Host lsw1-a8-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.72 ms [12:13:45] FIRING: Emergency syslog message: Alert for device lsw1-a8-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:14:00] RECOVERY - Bird Internet Routing Daemon on durum6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:14:18] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and lsw1-a8-codfw (10.192.252.10) - group EVPN_IBGP - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:14:51] RESOLVED: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-a1-codfw:et-0/0/7 (Core: lsw1-a8-codfw:et-0/0/55 {#230403800025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [12:15:14] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2006.codfw.wmnet with OS trixie [12:17:41] (03CR) 10Brouberol: [C:03+1] Stop sqooping the ipblocks table, since it no longer exists. [puppet] - 10https://gerrit.wikimedia.org/r/1166796 (https://phabricator.wikimedia.org/T398602) (owner: 10Btullis) [12:18:08] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://phabricator.wikimedia.org/T398315#10978694 (10Jclark-ctr) 05Openβ†’03Resolved a:03Jclark-ctr Replaced Cable and SFP-t looks to have link [12:18:38] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166804|Use dblist for wikilove]] (duration: 12m 28s) [12:18:45] RESOLVED: Emergency syslog message: Device lsw1-a8-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:18:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166808 (https://phabricator.wikimedia.org/T398692) (owner: 10Ladsgroup) [12:19:42] (03Merged) 10jenkins-bot: Revert "Increase max db connection count before circuit breaking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166808 (https://phabricator.wikimedia.org/T398692) (owner: 10Ladsgroup) [12:19:55] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1166808|Revert "Increase max db connection count before circuit breaking" (T398692)]] [12:19:58] T398692: Investigate circuit breaker for s4 incident - https://phabricator.wikimedia.org/T398692 [12:20:18] (03CR) 10Dreamy Jazz: [C:04-1] ukwiki: allow bureaucrats to assign and remove temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [12:21:47] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1166808|Revert "Increase max db connection count before circuit breaking" (T398692)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:22:39] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:22:40] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166812 [12:25:36] PROBLEM - Host wikikube-worker1069 is DOWN: PING CRITICAL - Packet loss = 100% [12:25:52] RESOLVED: InboundInterfaceErrors: Inbound errors on interface fasw2-c1a-eqiad:ge-0/0/11 (frmon1002) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=fasw2-c1a-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DInboundInterfaceErrors [12:28:09] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166808|Revert "Increase max db connection count before circuit breaking" (T398692)]] (duration: 08m 13s) [12:28:12] T398692: Investigate circuit breaker for s4 incident - https://phabricator.wikimedia.org/T398692 [12:28:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [12:30:19] (03Merged) 10jenkins-bot: Drop ability to use VueTest on a wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164156 (https://phabricator.wikimedia.org/T357475) (owner: 10Jforrester) [12:30:33] FIRING: KubernetesCalicoDown: wikikube-worker1069.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1069.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:30:33] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1164156|Drop ability to use VueTest on a wiki (T357475)]] [12:30:36] T357475: Decommission the VueTest Extension - https://phabricator.wikimedia.org/T357475 [12:31:01] Amir1, akosiaris, reboot done, all is clear for a repool. Thanks for your help [12:31:14] awesome [12:32:07] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2146* gradually with 4 steps - Work done [12:33:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.768s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:33:23] (03PS1) 10Zabe: Set categorylinks to read new in medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166816 (https://phabricator.wikimedia.org/T397912) [12:34:47] !log akosiaris@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2046.codfw.wmnet [12:34:49] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2046.codfw.wmnet [12:35:25] !log akosiaris@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2042.codfw.wmnet [12:35:27] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2042.codfw.wmnet [12:37:22] (03CR) 10Arnaudb: [C:03+2] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [12:38:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.768s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:39:05] (03PS1) 10Marostegui: db1250: Move to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1166817 (https://phabricator.wikimedia.org/T398805) [12:40:15] (03CR) 10Marostegui: [C:03+2] db1250: Move to m3 [puppet] - 10https://gerrit.wikimedia.org/r/1166817 (https://phabricator.wikimedia.org/T398805) (owner: 10Marostegui) [12:42:01] (03CR) 10Arnaudb: "typo preventing to merge with the current hardcoded values" [puppet] - 10https://gerrit.wikimedia.org/r/1165832 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [12:45:28] (03PS7) 10Ayounsi: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 [12:46:29] (03PS1) 10Ladsgroup: tables-catalog: Catalog wikilove_log [puppet] - 10https://gerrit.wikimedia.org/r/1166818 (https://phabricator.wikimedia.org/T363581) [12:46:59] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm [12:47:49] (03PS1) 10Elukey: profile::docker::reporter: move to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) [12:48:33] FIRING: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:54] (03CR) 10Btullis: [C:03+2] Stop sqooping the ipblocks table, since it no longer exists. [puppet] - 10https://gerrit.wikimedia.org/r/1166796 (https://phabricator.wikimedia.org/T398602) (owner: 10Btullis) [12:49:36] RESOLVED: [2x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:25] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6148/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166819 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:50:44] (03PS2) 10Ladsgroup: tables-catalog: Catalog wikilove_log [puppet] - 10https://gerrit.wikimedia.org/r/1166818 (https://phabricator.wikimedia.org/T363581) [12:50:50] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Catalog wikilove_log [puppet] - 10https://gerrit.wikimedia.org/r/1166818 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [12:52:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166783 (https://phabricator.wikimedia.org/T398792) (owner: 10Hamish) [12:54:13] !log ladsgroup@deploy1003 ladsgroup, jforrester: Backport for [[gerrit:1164156|Drop ability to use VueTest on a wiki (T357475)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:54:16] T357475: Decommission the VueTest Extension - https://phabricator.wikimedia.org/T357475 [12:55:01] !log ladsgroup@deploy1003 ladsgroup, jforrester: Continuing with sync [12:55:37] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [12:56:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:57:54] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2002) [12:57:58] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gerrit.topology-check (exit_code=0) Validate Gerrit topology (source=gerrit1003, replica=gerrit2002) [12:57:59] (03PS1) 10Jgreen: nsca_frack.cfg.erb add hostgroup fundraising-pay-lb and two new metrics [puppet] - 10https://gerrit.wikimedia.org/r/1166820 (https://phabricator.wikimedia.org/T398321) [12:58:37] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:58:39] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:59:18] (03CR) 10Volans: [C:03+2] Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [12:59:48] !log arnaudb@cumin1003 START - Cookbook sre.gerrit.topology-check Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [12:59:50] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.gerrit.topology-check (exit_code=99) Validate Gerrit topology (source=gerrit1003, replica=gerrit2003) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1300). [13:00:05] hamishcz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] :) [13:01:17] (03PS2) 10Muehlenhoff: crm: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/1166205 (https://phabricator.wikimedia.org/T135991) [13:01:42] I try to deploy it [13:01:51] but my current one needs to finish first [13:01:57] (03CR) 10Jgreen: [C:03+1] crm: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/1166205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:03:21] I’m around but kinda busy so if someone else can deploy that’s great :) [13:04:14] i can wait :) [13:04:23] (03CR) 10Ayounsi: [C:03+2] reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [13:04:41] If there are already links pages in the draft namespace, it might make sence to add the old name as an alias for the namespace in order to not break them [13:04:56] elukey@cumin2002 reimage (PID 2238497) is awaiting input [13:05:01] * TheresNoTime is stuck on something atm, so cannot deploy, sorry! [13:05:21] (03PS1) 10MVernon: thanos: remove now-drained thanos-be200[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1166822 (https://phabricator.wikimedia.org/T391352) [13:05:21] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [13:05:22] (03PS1) 10MVernon: hiera: remove thanos-be200[1-4] from thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1166823 (https://phabricator.wikimedia.org/T391352) [13:06:20] zabe: can we just replace the already linked pages? (didnt look into the amount but I think will not be so many) [13:06:29] sure [13:07:53] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [13:07:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1164156|Drop ability to use VueTest on a wiki (T357475)]] (duration: 37m 21s) [13:07:57] T357475: Decommission the VueTest Extension - https://phabricator.wikimedia.org/T357475 [13:08:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166783 (https://phabricator.wikimedia.org/T398792) (owner: 10Hamish) [13:08:42] (03PS4) 10Brouberol: Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [13:09:25] (03Merged) 10jenkins-bot: mrwiki: Correct draft namespace spelling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166783 (https://phabricator.wikimedia.org/T398792) (owner: 10Hamish) [13:09:42] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1166783|mrwiki: Correct draft namespace spelling (T398792)]] [13:09:45] T398792: Change spelling of namespace on mrwiki - https://phabricator.wikimedia.org/T398792 [13:10:34] (03Merged) 10jenkins-bot: Redfish: more tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1164473 (owner: 10Ayounsi) [13:11:57] !log ladsgroup@deploy1003 ladsgroup, hamishz: Backport for [[gerrit:1166783|mrwiki: Correct draft namespace spelling (T398792)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:12:05] (03Merged) 10jenkins-bot: reimage: temporarily store the MAC in Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/1164151 (owner: 10Ayounsi) [13:12:29] (03CR) 10Btullis: [C:03+2] Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [13:12:43] I can confirm the name has been corrected, but the question zabe raised still concerning, how should we deal with it? [13:13:03] as I cannot read mr language it made me difficult to do a insource search [13:13:05] (03PS2) 10Elukey: Release version 0.0.16 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) [13:13:07] hamishcz: ask the community to update the links [13:13:14] (03CR) 10Eevans: [C:03+1] thanos: remove now-drained thanos-be200[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1166822 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [13:13:21] can, then the patch can go on [13:13:25] !log ladsgroup@deploy1003 ladsgroup, hamishz: Continuing with sync [13:13:40] (03Merged) 10jenkins-bot: Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [13:13:47] (03CR) 10Eevans: [C:03+1] hiera: remove thanos-be200[1-4] from thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1166823 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [13:14:01] (03CR) 10Elukey: "@mmuhlenhoff@wikimedia.org updated the change with two nits, to run tests I needed to add python3-kubernetes among build-depends, and I ac" [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:16:54] (03CR) 10David Caro: [C:03+1] P:wmcs::metricsinfra: Hide 'logger' receiver from Karma [puppet] - 10https://gerrit.wikimedia.org/r/1166788 (https://phabricator.wikimedia.org/T398812) (owner: 10Majavah) [13:17:29] (03CR) 10Filippo Giunchedi: [C:03+2] nsca_frack.cfg.erb add hostgroup fundraising-pay-lb and two new metrics [puppet] - 10https://gerrit.wikimedia.org/r/1166820 (https://phabricator.wikimedia.org/T398321) (owner: 10Jgreen) [13:17:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2146* gradually with 4 steps - Work done [13:18:23] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: Hide 'logger' receiver from Karma [puppet] - 10https://gerrit.wikimedia.org/r/1166788 (https://phabricator.wikimedia.org/T398812) (owner: 10Majavah) [13:19:08] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166783|mrwiki: Correct draft namespace spelling (T398792)]] (duration: 09m 26s) [13:19:13] T398792: Change spelling of namespace on mrwiki - https://phabricator.wikimedia.org/T398792 [13:20:35] (03CR) 10MVernon: [C:03+2] thanos: remove now-drained thanos-be200[1-4] from rings [puppet] - 10https://gerrit.wikimedia.org/r/1166822 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [13:22:37] (03CR) 10David Caro: [C:03+1] "yay \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634) (owner: 10Majavah) [13:22:42] (03CR) 10Brouberol: [C:03+1] Increase thresholds for run_podsandbox and stop_podsandbox in dse-k8s [alerts] - 10https://gerrit.wikimedia.org/r/1156324 (https://phabricator.wikimedia.org/T396738) (owner: 10Btullis) [13:22:43] (03CR) 10David Caro: [C:03+1] P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634) (owner: 10Majavah) [13:24:06] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Put HAProxy in front of the Nginx instance [puppet] - 10https://gerrit.wikimedia.org/r/1165841 (https://phabricator.wikimedia.org/T397634) (owner: 10Majavah) [13:24:13] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634) (owner: 10Majavah) [13:24:18] thanks, now patch alive [13:24:22] (03PS1) 10Marostegui: db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166826 (https://phabricator.wikimedia.org/T398794) [13:24:24] (03PS2) 10Majavah: P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634) [13:25:14] (03CR) 10Marostegui: [C:03+2] db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1166826 (https://phabricator.wikimedia.org/T398794) (owner: 10Marostegui) [13:26:29] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: remove deprecated dumps 2 config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155644 (https://phabricator.wikimedia.org/T396593) (owner: 10Gmodena) [13:26:32] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: remove deprecated dumps 2 config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155644 (https://phabricator.wikimedia.org/T396593) (owner: 10Gmodena) [13:26:47] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bookworm [13:26:47] (03PS4) 10Ladsgroup: Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) [13:27:22] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm [13:27:45] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new in medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166816 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:28:07] (03CR) 10Filippo Giunchedi: [C:03+1] P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:28:10] !log sudo cumin 'C:bird' "disable-puppet 'merging CR 1166222'" [13:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:21] RECOVERY - Host wikikube-worker1069 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:28:35] (03Merged) 10jenkins-bot: Set categorylinks to read new in medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166816 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [13:28:52] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166816|Set categorylinks to read new in medium wikis (T397912)]] [13:28:54] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [13:29:04] (03CR) 10Ssingh: [C:03+2] bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit [puppet] - 10https://gerrit.wikimedia.org/r/1166222 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:29:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10978939 (10Jclark-ctr) Replaced Motherboard /backplane and / backplane cable Server is back up and can be repooled @Clement_Goubert [13:29:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10978941 (10Jclark-ctr) 05Openβ†’03Resolved [13:30:15] (03CR) 10Majavah: [C:03+2] P:toolforge::static: Handle simple redirects in HAProxy config [puppet] - 10https://gerrit.wikimedia.org/r/1165843 (https://phabricator.wikimedia.org/T397634) (owner: 10Majavah) [13:30:33] RESOLVED: KubernetesCalicoDown: wikikube-worker1069.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1069.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:30:36] (03PS4) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [13:30:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1166816|Set categorylinks to read new in medium wikis (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:05] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 190933080 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:31:37] (03CR) 10Muehlenhoff: [C:03+2] crm: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/1166205 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:31:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [13:31:47] !log zabe@deploy1003 zabe: Continuing with sync [13:32:05] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 261680 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:33:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. The depends on python3-kubernetes might even get auto-computed via shlibs, but we can also keep it in." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [13:34:26] !log sudo cumin -b11 'C:bird' "run-puppet-agent --enable 'merging CR 1166222'": NOOP change [13:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl2003.codfw.wmnet [13:36:09] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [13:37:10] (03PS1) 10Zabe: Revert "Set categorylinks to read new in medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166828 [13:37:14] (03CR) 10Zabe: [V:03+2 C:03+2] Revert "Set categorylinks to read new in medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166828 (owner: 10Zabe) [13:37:36] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166828|Revert "Set categorylinks to read new in medium wikis"]] [13:39:10] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:39:28] !log zabe@deploy1003 zabe: Backport for [[gerrit:1166828|Revert "Set categorylinks to read new in medium wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:33] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [13:39:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl2003.codfw.wmnet [13:40:10] !log zabe@deploy1003 zabe: Continuing with sync [13:40:14] (03PS1) 10Jelto: gitlab-runner: remove hiera host file for WMCS runner-1029 [puppet] - 10https://gerrit.wikimedia.org/r/1166829 (https://phabricator.wikimedia.org/T398628) [13:41:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:41:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:41:16] (03CR) 10Arnaudb: [C:03+1] "easy to review!" [puppet] - 10https://gerrit.wikimedia.org/r/1166829 (https://phabricator.wikimedia.org/T398628) (owner: 10Jelto) [13:42:08] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:27] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Use a separate site for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1166751 (owner: 10Vgutierrez) [13:45:04] !log homer "cr*eqiad*" commit 'wikikube-worker1069 back to active' [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:36] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166828|Revert "Set categorylinks to read new in medium wikis"]] (duration: 07m 59s) [13:45:37] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [13:46:47] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for wikikube-worker1069.eqiad.wmnet [13:46:47] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker1069.eqiad.wmnet [13:47:15] (03CR) 10Jelto: [C:03+2] gitlab-runner: remove hiera host file for WMCS runner-1029 [puppet] - 10https://gerrit.wikimedia.org/r/1166829 (https://phabricator.wikimedia.org/T398628) (owner: 10Jelto) [13:47:41] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1069.eqiad.wmnet [13:47:42] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1069.eqiad.wmnet [13:47:53] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10979044 (10brouberol) [13:47:56] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10979045 (10brouberol) 05Resolvedβ†’03In progress [13:48:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl2002.codfw.wmnet [13:48:12] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10979049 (10brouberol) 05In progressβ†’03Resolved [13:48:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10979051 (10Clement_Goubert) Host repooled, thanks! [13:49:06] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [13:51:11] !log sudo cumin 'A:dnsbox' "disable-puppet 'merging CR 1166210'" [13:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:20] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1166210 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [13:52:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl2002.codfw.wmnet [13:52:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl1003.eqiad.wmnet [13:53:20] (03CR) 10Eevans: [C:03+2] sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [13:54:00] (03PS3) 10Eevans: sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544) [13:54:00] (03PS3) 10Eevans: sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544) [13:54:00] (03PS3) 10Eevans: sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544) [13:54:35] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [13:56:50] (03CR) 10Eevans: [C:03+2] sessionstore1006: reimage for JBOD configuration [puppet] - 10https://gerrit.wikimedia.org/r/1165017 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [13:57:10] (03CR) 10Hnowlan: [V:03+2 C:03+2] ratelimit: bump version number [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1166412 (https://phabricator.wikimedia.org/T388804) (owner: 10Hnowlan) [13:57:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:57:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl1003.eqiad.wmnet [13:57:59] (03PS1) 10Vgutierrez: Revert "cache::haproxy: Use a separate site for port 80" [puppet] - 10https://gerrit.wikimedia.org/r/1166831 [13:58:19] (03PS1) 10Brouberol: Provision kafka-jumbo1016 [puppet] - 10https://gerrit.wikimedia.org/r/1166832 (https://phabricator.wikimedia.org/T398826) [13:58:21] (03PS1) 10Brouberol: Provision kafka-jumbo1017 [puppet] - 10https://gerrit.wikimedia.org/r/1166833 (https://phabricator.wikimedia.org/T398826) [13:58:22] (03PS1) 10Brouberol: Provision kafka-jumbo1018 [puppet] - 10https://gerrit.wikimedia.org/r/1166834 (https://phabricator.wikimedia.org/T398826) [13:59:12] (03PS5) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [13:59:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [14:00:05] 10SRE-SLO, 13Patch-For-Review: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#10979138 (10herron) I think we could do it, but before committing to the change could we expand a bit on rationale and side-effects/use cases? A quick list off the top of... [14:00:10] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10979139 (10Arnoldokoth) [14:01:02] (03CR) 10Vgutierrez: [C:03+2] Revert "cache::haproxy: Use a separate site for port 80" [puppet] - 10https://gerrit.wikimedia.org/r/1166831 (owner: 10Vgutierrez) [14:01:13] (03PS1) 10Ssingh: C:prometheus: use updated file name for dnsbox_service_state [puppet] - 10https://gerrit.wikimedia.org/r/1166836 (https://phabricator.wikimedia.org/T374619) [14:02:33] (03CR) 10Ssingh: [C:03+2] C:prometheus: use updated file name for dnsbox_service_state [puppet] - 10https://gerrit.wikimedia.org/r/1166836 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:02:50] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@f79034f]: remove dumps 1.0 sensor from SLIS [14:03:20] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@f79034f]: remove dumps 1.0 sensor from SLIS (duration: 00m 46s) [14:05:11] (03CR) 10Btullis: [C:03+1] Provision kafka-jumbo1016 [puppet] - 10https://gerrit.wikimedia.org/r/1166832 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [14:05:34] (03CR) 10Btullis: [C:03+1] Provision kafka-jumbo1017 [puppet] - 10https://gerrit.wikimedia.org/r/1166833 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [14:05:37] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 18.0 [14:05:48] (03CR) 10Btullis: [C:03+1] Provision kafka-jumbo1018 [puppet] - 10https://gerrit.wikimedia.org/r/1166834 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [14:06:17] (03CR) 10JHathaway: [C:03+1] "makes sense to me!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1164147 (owner: 10Ayounsi) [14:07:52] (03CR) 10Brouberol: [C:03+2] Provision kafka-jumbo1016 [puppet] - 10https://gerrit.wikimedia.org/r/1166832 (https://phabricator.wikimedia.org/T398826) (owner: 10Brouberol) [14:07:54] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bookworm [14:09:29] sudo cumin -b1 -s10 'A:dnsbox' "run-puppet-agent --enable 'merging CR 1166210'" [14:09:33] !log sudo cumin -b1 -s10 'A:dnsbox' "run-puppet-agent --enable 'merging CR 1166210'" [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10979172 (10Clement_Goubert) @cmooney SSH key verified out of band [14:10:47] !log decommissioning Cassandra/sessionstore-a β€” T391544 [14:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:51] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [14:13:54] (03PS3) 10Herron: pyrra-filesystem: clear output files on service start [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) [14:14:30] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1006.eqiad.wmnet with OS bullseye [14:14:45] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10979184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.e... [14:15:03] (03PS1) 10Ssingh: hiera: enable anycast-hc prom metrics for wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) [14:16:20] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6150/console" [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:17:08] (03PS2) 10Ssingh: hiera: enable anycast-hc prom metrics for wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) [14:18:19] !log brouberol@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [14:18:33] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:12] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:21:33] (03CR) 10Herron: "Thanks, sounds good to me. I think it'd work either way since thanos-rule would be reloaded when pyrra-filesystem (re)starts. In any eve" [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:22:22] headsup, I'm adding kafka-jumbo1016 host to the kafka-jumbo cluster. I've run puppet on all kafka and zk hosts, so kafka could start on 1016, meaning I'm not *expecting* alerts to fire. These kafka alerts directed to team:sre and are going to klaxon [14:22:32] If I was somehow wrong, I'm sorry in advance [14:22:48] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: sync [14:22:55] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [14:24:04] (03CR) 10Milimetric: [C:03+1] "This looks good to me from a privacy point of view, but I'm not sure if that join will perform ok on the wikis with bigger categorylinks t" [puppet] - 10https://gerrit.wikimedia.org/r/1165844 (https://phabricator.wikimedia.org/T299951) (owner: 10Zabe) [14:24:36] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:28] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1006.eqiad.wmnet with OS bullseye [14:25:41] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10979261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad... [14:25:44] (03PS6) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [14:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:25:51] (03CR) 10Jforrester: [C:03+1] Use FallbackContentHandler for another undeployed content handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166286 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz DziewoΕ„ski) [14:26:05] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1006.eqiad.wmnet with OS bullseye [14:26:26] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10979273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.e... [14:27:48] (03CR) 10Ladsgroup: "check exprimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:28:28] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [14:28:31] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [14:29:00] (03PS8) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [14:29:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:29:33] (03CR) 10Vgutierrez: [C:03+1] hiera: enable anycast-hc prom metrics for wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:29:35] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1430) [14:30:48] (03CR) 10Eevans: [C:03+2] sessionstore1006: assign JBOD data_file_directories [puppet] - 10https://gerrit.wikimedia.org/r/1165018 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [14:31:28] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1166846 [14:31:29] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10979314 (10andrea.denisse) p:05Triageβ†’03High [14:31:38] (03PS1) 10Vgutierrez: hiera: Switch lvs6003 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166847 (https://phabricator.wikimedia.org/T396561) [14:31:41] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1166846 [14:31:58] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166847 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:32:36] !log sudo cumin 'A:wikidough' "disable-puppet 'merging CR 1166838'" [14:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:00] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: enable anycast-hc prom metrics for wikidough [puppet] - 10https://gerrit.wikimedia.org/r/1166838 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:33:28] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [14:33:33] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:36] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:35] (03CR) 10Ssingh: [C:03+1] "Verified iface." [puppet] - 10https://gerrit.wikimedia.org/r/1166847 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:35:51] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs6003 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166847 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:37:04] (03CR) 10Filippo Giunchedi: [C:03+1] pyrra-filesystem: clear output files on service start [puppet] - 10https://gerrit.wikimedia.org/r/1165571 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [14:38:19] !log sudo cumin -s1 -b10 'A:wikidough' "run-puppet-agent --enable 'merging CR 1166838'" [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:26] ha [14:38:33] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:40] sukhe: task missing? O:) [14:38:53] na, -s1 -b10 and not -b1 -s10 :) [14:39:07] !log sudo cumin -b1 -s10 'A:wikidough' "run-puppet-agent --enable 'merging CR 1166838'" [14:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:39:41] (03CR) 10MVernon: [C:03+2] hiera: remove thanos-be200[1-4] from thanos::swift::backends [puppet] - 10https://gerrit.wikimedia.org/r/1166823 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [14:40:17] (03PS2) 10Hnowlan: api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) [14:40:21] 06SRE, 06DC-Ops, 10procurement: netbox export of Juniper device csv fails - https://phabricator.wikimedia.org/T398836 (10RobH) 03NEW p:05Triageβ†’03High [14:41:06] !log switching lvs6003 to katran - T396561 [14:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:10] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [14:42:01] (03PS3) 10Ssingh: hiera: dnsbox: set supplementary_groups and enable Prom metrics (anycast-hc) [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) [14:42:08] (03PS3) 10Hnowlan: api-gateway: use ratelimit's inbuilt promethus-statsd agent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166790 (https://phabricator.wikimedia.org/T388804) [14:42:22] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [14:42:51] 06SRE, 06DC-Ops, 10procurement: netbox export of Juniper device csv fails - https://phabricator.wikimedia.org/T398836#10979414 (10RobH) Arzhel poked something, now the error has shifted and is: There was an error rendering the selected export template (CSV All fields): 'dcim.models.racks.Rack object' has... [14:43:09] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6156/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:43:18] (03PS1) 10Vgutierrez: hiera: Consolidate liberica fp settings in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1166849 (https://phabricator.wikimedia.org/T396561) [14:43:38] (03CR) 10Ssingh: [V:03+1] "SupplementaryGroups=prometheus-node-exporter was tested on Wikidough hosts and it's working fine, so rolling it out on these." [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:45:38] (03CR) 10Filippo Giunchedi: [C:03+1] hiera: dnsbox: set supplementary_groups and enable Prom metrics (anycast-hc) [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:46:22] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [14:46:41] (03PS3) 10Herron: alerting_host: set puppet agent to 5m interval [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) [14:46:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166849 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:47:01] !log sudo cumin 'A:dnsbox' "disable-puppet 'merging CR 1166223'": rolling out prom metrics for anycast-hc: T374619 [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] T374619: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 [14:50:28] (03CR) 10Ssingh: [C:03+1] hiera: Consolidate liberica fp settings in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1166849 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:50:35] (03CR) 10Vgutierrez: [C:03+2] hiera: Consolidate liberica fp settings in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1166849 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:51:00] 06SRE, 06DC-Ops, 10procurement: netbox export of Juniper device csv fails - https://phabricator.wikimedia.org/T398836#10979507 (10RobH) 05Openβ†’03Resolved @ayounsi fixed it, thank you! [14:51:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:51:14] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: dnsbox: set supplementary_groups and enable Prom metrics (anycast-hc) [puppet] - 10https://gerrit.wikimedia.org/r/1166223 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [14:51:29] (03CR) 10Ahmon Dancy: "Can someone poke this change please? When it was +2'd it went into "ready to submit" state but never made progress after that." [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:51:33] (03PS1) 10Vgutierrez: hiera: Switch lvs3010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166850 (https://phabricator.wikimedia.org/T396561) [14:51:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166850 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:51:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163415 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [14:52:43] (03CR) 10David Caro: [C:03+1] hieradata: Enable NAT logging on cloudgw1004 [puppet] - 10https://gerrit.wikimedia.org/r/1166787 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [14:53:08] (03CR) 10Ssingh: [C:03+1] hiera: Switch lvs3010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166850 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:54:12] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs3010 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1166850 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [14:54:42] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing CR 1166223: T374619] [14:54:45] T374619: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 [14:54:57] (03CR) 10Ladsgroup: "amir@amir:~/workspace/tables catalog$ diff <(grep -i '+ -' public_tables_diff | sort | sed -r 's/^\+ \-//') <(grep -i '\- \-' public_ta" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:55:06] (03CR) 10David Caro: [C:03+1] ":crossingfingers:" [puppet] - 10https://gerrit.wikimedia.org/r/1166793 (https://phabricator.wikimedia.org/T396038) (owner: 10Majavah) [14:55:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host aux-k8s-ctrl1002.eqiad.wmnet [14:55:39] (03CR) 10Ladsgroup: "Useless: https://phabricator.wikimedia.org/P78777" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:55:45] (03PS5) 10Ladsgroup: Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) [14:55:54] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:56:13] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add source_labels to severity relabel rule [puppet] - 10https://gerrit.wikimedia.org/r/1166793 (https://phabricator.wikimedia.org/T396038) (owner: 10Majavah) [14:56:22] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Enable NAT logging on cloudgw1004 [puppet] - 10https://gerrit.wikimedia.org/r/1166787 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [14:56:33] (03PS9) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [14:56:38] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [14:57:58] (03PS1) 10Andrew Bogott: partman_early_command: don't wipe out lvm for cloudcephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1166852 (https://phabricator.wikimedia.org/T306820) [14:58:07] (03CR) 10Elukey: [C:03+2] Release version 0.0.16 [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1166802 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [14:58:21] !log switching lvs3010 to katran - T396561 [14:58:21] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: [done] testing CR 1166223: T374619] [14:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:24] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [15:00:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aux-k8s-ctrl1002.eqiad.wmnet [15:00:37] !log sudo cumin -b1 -s120 'A:dnsbox and not P{dns7001*}' "run-puppet-agent --enable 'merging CR 1166223'": T374619 [15:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:40] T374619: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619 [15:01:55] /win 44 [15:01:57] heh [15:02:29] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [15:02:32] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [15:02:42] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=99) rolling restart_daemons on A:kafka-jumbo-eqiad [15:03:04] 10SRE-SLO, 13Patch-For-Review: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#10979643 (10elukey) @herron sure! So my idea stems from the fact that discovery endpoints like all k8s services should have a single SLO, since how we pool/depool and manag... [15:04:34] (03PS1) 10Ladsgroup: tables-catalog: Mark revision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) [15:04:47] (03CR) 10David Caro: [C:03+1] partman_early_command: don't wipe out lvm for cloudcephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1166852 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [15:05:01] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398843 (10phaultfinder) 03NEW [15:05:01] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398842 (10phaultfinder) 03NEW [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:47] (03PS2) 10Ladsgroup: tables-catalog: Mark vision to 1 [puppet] - 10https://gerrit.wikimedia.org/r/1166854 (https://phabricator.wikimedia.org/T363581) [15:07:46] (03PS1) 10Krinkle: interwiki-labs.php: Regenerate interwiki map (switch to beta.wmcloud.org) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166855 (https://phabricator.wikimedia.org/T289318) [15:08:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=eventgate-analytics-external.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors [15:09:14] brouberol: ^^ is that you? [15:09:34] vgutierrez: the ATS alert? [15:09:47] yes [15:10:15] not that I know of, but let's not cross that off the list [15:10:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts thanos-be[2001-2004].codfw.wmnet [15:11:25] FIRING: SystemdUnitFailed: sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:38] brouberol: is the new host already in the list of allowed ones for the network policies? [15:12:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2006.codfw.wmnet [15:12:38] yes, I had run puppet on all kafka/zk hosts beforehadn [15:12:40] *beforehand [15:13:01] no I mean from the eventgate pods to the new kafka host [15:13:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:13:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:13:53] good call, applying the external-services diff now [15:13:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:14:08] this paged everyone AFAICT [15:14:17] * Emperor here [15:14:27] acked [15:14:33] I see a ton of timeouts in https://logstash.wikimedia.org/goto/dd378320f56189d7c0916fffb1d049aa [15:15:08] do on-call need assistance, or was that just an overenthusiastic p.age-all? [15:15:57] Here as well. [15:16:38] I think it was overenthusiastic [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:11] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:17:19] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:17:39] elukey: Are you taking care of the issue with these actions? [15:17:47] brett: yeah I am [15:17:50] thanks! [15:18:03] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:18:29] if I am right eventgate pods should be happer in a bit [15:18:42] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:18:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2006.codfw.wmnet [15:19:10] elukey: shall I do eqiad as well? [15:19:23] sorry, I was afk for a bit, right at the wrong time [15:19:25] brouberol: already done, but not sure if it is it [15:20:09] I think we need to roll restart event gate pods as well [15:20:11] lemme do it [15:20:17] what I don't understand is that kafka-jumbo1016 has no data whatsoever [15:20:18] brouberol@kafka-jumbo1007:~$ kafka topics --describe | grep 1016 [15:20:18] brouberol@kafka-jumbo1007:~$ [15:20:40] thanks [15:20:52] yeah I am wondering if the new host is used by the pods to discover topics etc.. [15:21:00] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [15:21:07] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:21:20] brouberol: in the meantime, can you check if you see anything weird in grafana for jumbo? [15:21:27] sigh, adding any host to this cluster is so brittle. I wish there was a way to route these alerts to us instead of *everyone*, even if temporarily while we run these operations [15:21:46] sure [15:22:01] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-codfw [15:22:12] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [15:22:56] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [15:23:18] brouberol: wow kafka-jumbo1010 is streaming a ton of data [15:23:39] Thanks elukey! [15:24:25] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [15:24:47] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [15:24:51] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-be[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:25:23] I'm seeing multiple under-replicated partitions, from multiple brokers. 1010 started to stream a ton of data [15:25:41] from the k8s app logs, I see a reduction in errors after the roll restart [15:25:49] what I ran was: add 1016 to the cluster, run puppet on the whole cluster + zk, start a rolling-restart of the kafka cluster [15:25:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-be[2001-2004].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:25:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thanos-be[2001-2004].codfw.wmnet [15:26:16] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10979766 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `thanos-be[2001-2004].codfw.wmnet` - thanos-be2001.codfw.wmnet (**PASS**) - Downti... [15:26:22] I'm trying to restore my tmux session from cumin2002, and somehow can't ssh [15:26:53] I see also 503s starting to go down [15:27:00] I cannot explain the bw usage though [15:27:13] it seems as if something started to pull data on a new broker [15:27:45] I am wondering if this is not a weird eventgate side effect [15:28:04] maybe it tried aggressively to push data to retry topics? [15:28:15] I'm not sure. It almost feels like kafka acts as it data was supposed to move between broker [15:28:33] I'm seeing many of these logs on 1013 [15:28:33] [2025-07-07 15:28:25,948] INFO [ReplicaFetcher replicaId=1013, leaderId=1009, fetcherId=2] Remote broker is not the leader for partition rc1.eqiad.mediawiki.page_change-0, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread) [15:28:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:28:51] (03PS1) 10Hashar: Change update to exactly match the given image name [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 [15:28:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-codfw [15:29:58] (03CR) 10Hashar: "That would fix the issue we have encountered in `integration/config` with `docker-pkg update php83 .` unexpectedly updating `quibble-php83" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [15:29:58] it kind of feels like we're being hit by https://issues.apache.org/jira/browse/KAFKA-7447 [15:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1530). [15:30:29] (03CR) 10Muehlenhoff: "Now submitted, will be deployed fleet-wide within 30 minutes." [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:30:42] brouberol: but it doesn't explain a lot why it got fixed after the eventgate's roll restart [15:31:18] (03CR) 10Ahmon Dancy: "Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1165912 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [15:31:21] what is the bootstrap url that eventgdate uses to connect to jumbo? [15:32:20] brouberol: not sure, but IIRC after the first call kafka may use a more dynamic broker/topic list from the broker that communicated with [15:32:22] one hypothesis might be that if it uses the external-services DNS and *if* the kafka-jumbo1016 IP appears first, then eventgate would attempt to connect to kafka via 1016 and didn't have the egress rule to do so [15:33:00] !log installing postgresql security updates [15:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:41] (03CR) 10JHathaway: puppetserver: check for rebase in puppetserver-deploy-code (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis) [15:34:00] brouberol: so the errors I think are starting to climb back up [15:34:19] (03PS7) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [15:34:34] elukey do you want to hop on a meet to discuss? [15:35:47] brouberol: could be an option yes [15:36:02] I'll send you a link. I'm dealing with a bunch of compound issues [15:36:02] so kafka1010 seems the one sending a ton of data, and in its logs I see [15:36:05] [2025-07-07 15:35:21,279] WARN Failed to send SSL Close message (org.apache.kafka.common.network.SslTransportLayer) [15:36:39] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-be200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T398849 (10MatthewVernon) 03NEW [15:37:01] !log installing zsh updates from Bookworm point release [15:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:59] (03PS1) 10Vgutierrez: hiera: Unify liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166860 (https://phabricator.wikimedia.org/T396561) [15:38:04] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10979854 (10MatthewVernon) [15:38:18] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10979859 (10MatthewVernon) 05Openβ†’03Resolved All done! [15:38:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166860 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:38:48] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10979862 (10MatthewVernon) [15:39:20] (03PS1) 10Majavah: hieradata: Enable natlog on all cloudgw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1166861 (https://phabricator.wikimedia.org/T273734) [15:39:44] (03CR) 10David Caro: [C:03+1] "nice! πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/1166762 (owner: 10Majavah) [15:40:07] (03CR) 10David Caro: [C:03+1] Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611 (owner: 10Andrew Bogott) [15:40:32] there's something fundamentally wrong with what I'm seeing. I rolling-restarted the cluster, which shouldn't move data around. And I'm now seeing brokers spew these logs [15:40:32] [2025-07-07 15:38:43,947] INFO [ReplicaFetcher replicaId=1007, leaderId=1013, fetcherId=2] Retrying leaderEpoch request for partition eqiad.mediawiki.web_ab_test_enrollment-0 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread) [15:40:37] for topics that are not under replicated [15:40:48] (03CR) 10Majavah: [C:03+2] toolforge: Simplify wmcs-wheel-of-misfortune (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166762 (owner: 10Majavah) [15:41:02] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1139781 (owner: 10Majavah) [15:42:29] (03CR) 10Majavah: [C:03+2] P:wmcs: toolsdb_replica_cnf: Remove HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/1139781 (owner: 10Majavah) [15:42:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10979873 (10MoritzMuehlenhoff) [15:44:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=eventgate-analytics-external.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrors [15:45:55] Accidentally resolved instead of acked ^^; [15:46:25] RESOLVED: SystemdUnitFailed: sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:40] !log installing busybox updates from Bookworm point release [15:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:40] 06SRE, 06collaboration-services, 10Release-Engineering-Team (Radar): Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846#10979894 (10LSobanski) A realistic solution we can offer effort-wise is redirecting all of the svn. links to th... [15:48:09] (03PS8) 10Fabfur: varnish: replace X-Public-Cloud with new X-Provenance header check [puppet] - 10https://gerrit.wikimedia.org/r/1159995 (https://phabricator.wikimedia.org/T396621) [15:48:32] (03CR) 10JHathaway: [C:03+1] New structure for sshd_config starting with trixie (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [15:49:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:55:34] !log kafka-preferred-replica on kafka-jumbo [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] !log restart kafka on kafka1012 (first node without restart in the previous cookbook run) [15:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:39] (03CR) 10Ssingh: [C:03+1] hiera: Unify liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166860 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:57:59] (03CR) 10Vgutierrez: [C:03+2] hiera: Unify liberica fp settings [puppet] - 10https://gerrit.wikimedia.org/r/1166860 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [15:59:36] !log restart kafka on kafka1013 (second node without restart in the previous cookbook run) [15:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:46] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [16:00:49] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [16:01:44] (03PS1) 10Vgutierrez: hiera: Restore liberica bgp_config settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1166863 (https://phabricator.wikimedia.org/T396561) [16:02:40] !log restart kafka on kafka1014 (second node without restart in the previous cookbook run) [16:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] !log restart kafka on kafka1015 (forth and last node without restart in the previous cookbook run) [16:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:05] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166863 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:08:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: InboundInterfaceErrors reports for fasw2-c1a-eqiad:9804 frmon1002 ge-0/0/11 - https://phabricator.wikimedia.org/T398442#10979971 (10Jgreen) 05Duplicateβ†’03Resolved [16:08:24] !log kafka preferred-replica-election on kafka1011 to rebalance partition leaders on kafka-jumbo [16:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventgate-analytics-external.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:10:22] all right I think we are out of the woods [16:10:29] {β—• β—‘ β—•} [16:11:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:13:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10979993 (10MoritzMuehlenhoff) [16:15:26] (03Abandoned) 10Vgutierrez: hiera: Restore liberica bgp_config settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1166863 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [16:15:54] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [16:15:57] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [16:16:00] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [16:16:52] 10SRE-Access-Requests, 10Infrastructure Security, 06MediaWiki-Gerrit-Group-Requests: Request membership in mediawiki group for Cicalese - https://phabricator.wikimedia.org/T398122#10980009 (10Dzahn) [16:17:19] (03PS1) 10Vgutierrez: hiera: Remove unnecessary bgp_config peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166865 [16:18:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166865 (owner: 10Vgutierrez) [16:20:08] !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore1006.eqiad.wmnet with OS bullseye [16:20:18] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10980021 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad... [16:23:29] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:03] (03PS2) 10Vgutierrez: hiera: Remove unnecessary bgp_config peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166865 [16:26:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166865 (owner: 10Vgutierrez) [16:28:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10980051 (10VRiley-WMF) Pinging @Eevans on this to see if there is a timeframe to shut down the server and check the serial numbers of the drives and start trying to replace them. [16:30:30] (03CR) 10Ssingh: [C:03+1] hiera: Remove unnecessary bgp_config peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166865 (owner: 10Vgutierrez) [16:30:51] (03CR) 10Vgutierrez: [C:03+2] hiera: Remove unnecessary bgp_config peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166865 (owner: 10Vgutierrez) [16:38:29] (03CR) 10Jforrester: [C:03+1] Change update to exactly match the given image name [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1166856 (owner: 10Hashar) [16:38:36] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1006.eqiad.wmnet with OS bullseye [16:38:47] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10980090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.e... [16:40:04] (03CR) 10Majavah: [C:03+2] hieradata: Enable natlog on all cloudgw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1166861 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [16:42:19] (03PS1) 10Vgutierrez: hiera: Remove lvs3009 bgp peer override [puppet] - 10https://gerrit.wikimedia.org/r/1166870 [16:42:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166870 (owner: 10Vgutierrez) [16:43:02] !log taavi@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet [16:45:02] (03PS1) 10Btullis: Add fake cephx key data for the new cephosd cluster in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/1166871 (https://phabricator.wikimedia.org/T374923) [16:49:03] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore1006.eqiad.wmnet with OS bullseye [16:49:16] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10980113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad... [16:49:24] !log taavi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet [16:49:26] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore1006.eqiad.wmnet with OS bullseye [16:49:30] (03PS2) 10Vgutierrez: hiera: Remove esams and magru bgp peer overrides [puppet] - 10https://gerrit.wikimedia.org/r/1166870 [16:49:38] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10980114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore1006.e... [16:50:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166870 (owner: 10Vgutierrez) [16:53:29] RESOLVED: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:33] RESOLVED: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:00] Thanks again elukey. That truly was some strange behavior and failure mode from kafka [16:56:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:57:04] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10980135 (10VRiley-WMF) Pulled the power and plugged it back in. It seems that it has come back with no issues. @Marostegui is there anything else would you like done on this? [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1700) [17:00:05] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1700). nyaa~ [17:02:59] (03CR) 10Btullis: [V:03+2 C:03+2] Add fake cephx key data for the new cephosd cluster in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/1166871 (https://phabricator.wikimedia.org/T374923) (owner: 10Btullis) [17:03:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:25] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:25] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:26] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:28] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:48] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:50] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:50] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:51] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:52] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:52] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:52] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:54] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:56] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:56] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:56] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:56] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:56] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:57] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:57] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:58] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:58] ^^ lookg in to this now [17:03:58] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:04:00] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:04:14] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:04:14] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:04:14] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:05:11] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [17:06:57] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search*,name=eqiad [17:07:26] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search-omega*,name=eqiad [17:07:33] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search-psi*,name=eqiad [17:09:04] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore1006.eqiad.wmnet with reason: host reimage [17:09:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:09:50] (03PS3) 10Dzahn: Revert^2 "gerrit: add a second replica, start replicating to gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 (https://phabricator.wikimedia.org/T395887) [17:09:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:09:57] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:09:57] (03CR) 10Dzahn: [C:03+2] Revert^2 "gerrit: add a second replica, start replicating to gerrit2003" [puppet] - 10https://gerrit.wikimedia.org/r/1153265 (https://phabricator.wikimedia.org/T395887) (owner: 10Dzahn) [17:12:28] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [17:12:35] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [17:13:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:14:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:17:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:19:19] (03CR) 10Ssingh: [C:03+1] "Verified, and also that there are no other pending changes for other LVS hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1166870 (owner: 10Vgutierrez) [17:19:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10980183 (10VRiley-WMF) @BCornwall Hey, I just wanted to check in with this to see if anything else is needed with this at the moment? If so, are we able to close this, or would you like to continue... [17:20:17] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1081 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:17] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5284, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:17] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:17] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 5279, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:21] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_t [17:20:21] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 8317, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:23] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 9483, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:23] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 9635, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:23] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 9724, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:24] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:20:24] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 9984, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [17:20:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [17:23:15] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:15] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:15] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:15] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:15] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:15] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:16] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:16] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:17] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:18] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:18] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:19] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:20] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:21] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:21] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:22] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch inactive shards 2957 threshold =0.15 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1425, active_shards: 1425, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2957, delayed_unassigned_shar [17:23:23] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.51939753537197 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:24] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:25] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:25] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:26] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.2 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unas [17:23:28] hards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:28] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:28] (03CR) 10CDanis: [C:03+1] cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:23:29] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:30] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:30] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:30] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:31] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:32] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:32] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:32] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:33] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:33] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:34] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:34] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:34] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:23:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch inactive shards 2956 threshold =0.15 breach: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 1405, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2956, delayed_unassigned_s [17:23:35] , number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 32.21738133455629 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:29:03] ^ brett : I'm not sure if this one will page. [17:29:36] oof [17:31:15] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [17:32:33] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1070 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 3776, relocating_shards: 0, initializing_shards: 69, unassigned_shards: 516, delayed_unassigned_shards: 0, number_of_pend [17:32:33] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 564, active_shards_percent_as_number: 86.58564549415271 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:33] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1107 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 3776, relocating_shards: 0, initializing_shards: 69, unassigned_shards: 516, delayed_unassigned_shards: 0, number_of_pend [17:32:33] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 575, active_shards_percent_as_number: 86.58564549415271 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:35] (03PS1) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [17:32:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4049, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pend [17:32:45] s: 2, number_of_in_flight_fetch: 108, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 92.84567759688144 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4049, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pend [17:32:45] s: 2, number_of_in_flight_fetch: 108, task_max_waiting_in_queue_millis: 84, active_shards_percent_as_number: 92.84567759688144 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4049, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pend [17:32:45] s: 2, number_of_in_flight_fetch: 108, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 92.84567759688144 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:46] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4049, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pend [17:32:46] s: 2, number_of_in_flight_fetch: 270, task_max_waiting_in_queue_millis: 92, active_shards_percent_as_number: 92.84567759688144 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:47] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1097 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4049, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 253, delayed_unassigned_shards: 0, number_of_pend [17:32:47] s: 2, number_of_in_flight_fetch: 162, task_max_waiting_in_queue_millis: 96, active_shards_percent_as_number: 92.84567759688144 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:49] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1083 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4092, relocating_shards: 0, initializing_shards: 62, unassigned_shards: 207, delayed_unassigned_shards: 0, number_of_pend [17:32:49] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 880, active_shards_percent_as_number: 93.83168997936254 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:49] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1095 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4092, relocating_shards: 0, initializing_shards: 62, unassigned_shards: 207, delayed_unassigned_shards: 0, number_of_pend [17:32:49] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 888, active_shards_percent_as_number: 93.83168997936254 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:51] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 162, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1080 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:51] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 172, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:52] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:52] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 180, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:53] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 194, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:54] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:54] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 196, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1118 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:56] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 221, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:56] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1119 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:57] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 230, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1109 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:58] s: 5, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 233, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:58] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:59] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 243, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1123 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 191, delayed_unassigned_shards: 0, number_of_pend [17:32:59] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 245, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:00] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4113, relocating_shards: 0, initializing_shards: 68, unassigned_shards: 180, delayed_unassigned_shards: 0, number_of_pend [17:33:00] s: 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 270, active_shards_percent_as_number: 94.31323091034166 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:01] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1068 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4180, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 122, delayed_unassigned_shards: 0, number_of_pend [17:33:01] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 405, active_shards_percent_as_number: 95.84957578537033 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:02] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1089 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4180, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 122, delayed_unassigned_shards: 0, number_of_pend [17:33:02] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 407, active_shards_percent_as_number: 95.84957578537033 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:03] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4180, relocating_shards: 0, initializing_shards: 59, unassigned_shards: 122, delayed_unassigned_shards: 0, number_of_pend [17:33:03] s: 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 406, active_shards_percent_as_number: 95.84957578537033 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:04] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1115 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4194, relocating_shards: 0, initializing_shards: 56, unassigned_shards: 111, delayed_unassigned_shards: 0, number_of_pend [17:33:04] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 145, active_shards_percent_as_number: 96.17060307268976 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:05] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4207, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 97, delayed_unassigned_shards: 0, number_of_pendi [17:33:07] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 222, active_shards_percent_as_number: 96.46869983948636 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:07] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4207, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 97, delayed_unassigned_shards: 0, number_of_pendi [17:33:07] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 223, active_shards_percent_as_number: 96.46869983948636 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:07] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4207, relocating_shards: 0, initializing_shards: 57, unassigned_shards: 97, delayed_unassigned_shards: 0, number_of_pendi [17:33:07] : 3, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 238, active_shards_percent_as_number: 96.46869983948636 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:13] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:15] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:15] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:15] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:16] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:16] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:17] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1081 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:17] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:18] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:18] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1074 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:19] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:20] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:21] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:21] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:22] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:22] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1120 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:23] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:24] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:24] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4235, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:25] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.1107544141252 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:26] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1117 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4237, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:26] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.15661545517084 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:28] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4237, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 74, delayed_unassigned_shards: 0, number [17:33:29] ing_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.15661545517084 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:29] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:29] FIRING: SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:33:30] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:30] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1116 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:30] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:31] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1071 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:31] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:32] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:32] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:33] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1110 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:33] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1114 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:35] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:35] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:35] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:33:36] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4238, relocating_shards: 0, initializing_shards: 49, unassigned_shards: 74, delayed_unassigned_shards: 0, number_of_pendi [17:33:36] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.17954597569364 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:34:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1405, active_shards: 4264, relocating_shards: 0, initializing_shards: 43, unassigned_shards: 54, delayed_unassigned_shards: 0, number_of_pendi [17:34:23] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 97.77573950928686 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:34:43] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:35:50] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore1006.eqiad.wmnet with OS bullseye [17:36:01] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10980234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore1006.eqiad... [17:36:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:38:29] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:41] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4322, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 14, delayed_unassigned_shards: 0, number_of_pendi [17:38:41] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.10570969961017 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:38:41] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1099 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4322, relocating_shards: 0, initializing_shards: 25, unassigned_shards: 14, delayed_unassigned_shards: 0, number_of_pendi [17:38:41] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.10570969961017 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:38:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1100 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1405, active_shards: 4324, relocating_shards: 0, initializing_shards: 23, unassigned_shards: 14, delayed_unassigned_shards: 0, number_of_pendi [17:38:51] : 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.15157074065581 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:40:10] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search*,name=eqiad [17:40:18] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search-psi*,name=eqiad [17:40:25] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search-omega*,name=eqiad [17:45:29] !log [start] rolling upgrade of haproxy on A:dnsbox to 2.6.12-1+deb12u2 [17:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:52:20] (03PS2) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [17:52:45] (03CR) 10CI reject: [V:04-1] gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:53:13] (03PS1) 10Zabe: Apply conditions to correct column [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166886 (https://phabricator.wikimedia.org/T398823) [17:53:29] FIRING: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:43] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cirrussearch1100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:55:14] (03CR) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:55:54] (03PS3) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [17:57:33] jouncebot: nowandnext [17:57:33] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T1700) [17:57:33] In 2 hour(s) and 2 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T2000) [17:57:56] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:57:59] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [17:57:59] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium_restart (exit_code=99) [17:58:09] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitarium_restart [17:58:24] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [17:58:27] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [17:58:34] !log bking@cumin1002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [17:58:37] (03CR) 10Zabe: [C:03+2] Apply conditions to correct column [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166886 (https://phabricator.wikimedia.org/T398823) (owner: 10Zabe) [17:59:12] (03PS4) 10Dzahn: gerrit: avoid hardcoded hostnames, replace with hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) [18:00:00] (03Merged) 10jenkins-bot: Apply conditions to correct column [extensions/CategoryTree] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166886 (https://phabricator.wikimedia.org/T398823) (owner: 10Zabe) [18:00:54] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166886|Apply conditions to correct column (T398823)]] [18:00:57] T398823: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398823 [18:03:06] !log zabe@deploy1003 zabe: Backport for [[gerrit:1166886|Apply conditions to correct column (T398823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:03:29] RESOLVED: [2x] SystemdUnitFailed: opensearch-disable-readahead-production-search-eqiad.service on cirrussearch1100:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:50] !log [emd] rolling upgrade of haproxy on A:dnsbox to 2.6.12-1+deb12u2 [18:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:56] !log [end] rolling upgrade of haproxy on A:dnsbox to 2.6.12-1+deb12u2 [18:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:25] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544#10980283 (10Dzahn) [18:06:40] !log zabe@deploy1003 zabe: Continuing with sync [18:08:39] !log sukhe@dns1004 START - running authdns-update [18:09:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitarium_restart (exit_code=0) [18:09:23] !log sukhe@dns1004 END - running authdns-update [18:10:35] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544#10980287 (10Dzahn) https://phabricator.wikimedia.org/search/query/13biMOGEdWOT/ looks like this is a general issue with many or all management interfaces in co... [18:10:37] !log bootstrapping Cassandra/sessionstore1006-a β€” T391544 [18:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:40] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [18:11:42] (03CR) 10Eevans: [C:03+2] sessionstore: preseed eqiad servers for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1165019 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [18:12:08] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166886|Apply conditions to correct column (T398823)]] (duration: 11m 14s) [18:12:12] T398823: Expectation (readQueryTime <= 5) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398823 [18:13:33] FIRING: [2x] ProbeDown: Service sessionstore1006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:22] (03CR) 10Dzahn: "manual rebase done and simplified. This should come first now before other related changes." [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:19:29] (03CR) 10Dzahn: "meanwhile we have "replica-a" and "replica-b" in replication config and that is merged again now." [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:20:08] (03CR) 10Dzahn: [V:03+1 C:03+1] "no changes in compiler https://puppet-compiler.wmflabs.org/output/1129920/6181/" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:22:44] (03PS2) 10Eevans: adjust sessionstore disk utilization for JBOD [alerts] - 10https://gerrit.wikimedia.org/r/1163007 (https://phabricator.wikimedia.org/T391544) [18:23:15] (03CR) 10Eevans: "Expression tested using: https://grafana.wikimedia.org/goto/CFjIlYsNg?orgId=1 (if that helps)" [alerts] - 10https://gerrit.wikimedia.org/r/1163007 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [18:23:48] (03PS1) 10Zabe: Revert^2 "Set categorylinks to read new in medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166890 (https://phabricator.wikimedia.org/T397912) [18:29:48] (03CR) 10Zabe: [C:03+2] Revert^2 "Set categorylinks to read new in medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166890 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [18:30:41] (03Merged) 10jenkins-bot: Revert^2 "Set categorylinks to read new in medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166890 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [18:31:03] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166890|Revert^2 "Set categorylinks to read new in medium wikis" (T397912)]] [18:31:05] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [18:32:56] !log zabe@deploy1003 zabe: Backport for [[gerrit:1166890|Revert^2 "Set categorylinks to read new in medium wikis" (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:35:32] !log zabe@deploy1003 zabe: Continuing with sync [18:37:49] (03PS1) 10Bvibber: Fix for validation error display in transformed chart data [extensions/Chart] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166895 (https://phabricator.wikimedia.org/T398597) [18:39:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye [18:39:49] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10980387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1179.eqiad.wmnet with OS bullseye [18:40:57] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166890|Revert^2 "Set categorylinks to read new in medium wikis" (T397912)]] (duration: 09m 54s) [18:41:00] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [18:42:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166895 (https://phabricator.wikimedia.org/T398597) (owner: 10Bvibber) [18:42:10] 10ops-codfw, 06DC-Ops: msw-d3-codfw & msw-d8-codfw offline - https://phabricator.wikimedia.org/T398858 (10RobH) 03NEW p:05Triageβ†’03High [18:42:32] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398843#10980407 (10Dzahn) [18:42:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980408 (10Dzahn) [18:42:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398842#10980412 (10Dzahn) [18:42:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980413 (10Dzahn) [18:42:46] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2220.mgmt:22 - https://phabricator.wikimedia.org/T398587#10980414 (10Dzahn) [18:42:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980415 (10Dzahn) [18:42:55] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2195.mgmt:22 - https://phabricator.wikimedia.org/T398586#10980416 (10Dzahn) [18:42:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980417 (10Dzahn) [18:43:07] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2174.mgmt:22 - https://phabricator.wikimedia.org/T398585#10980418 (10Dzahn) [18:43:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980419 (10Dzahn) [18:43:19] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2198.mgmt:22 - https://phabricator.wikimedia.org/T398584#10980420 (10Dzahn) [18:43:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980421 (10Dzahn) [18:43:31] 10ops-codfw, 06DC-Ops: msw-d3-codfw & msw-d8-codfw offline - https://phabricator.wikimedia.org/T398858#10980423 (10RobH) [18:43:33] RESOLVED: ProbeDown: Service sessionstore1006-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore1006-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:43:35] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544#10980422 (10RobH) [18:43:39] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2048.mgmt:22 - https://phabricator.wikimedia.org/T398583#10980424 (10Dzahn) [18:43:43] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980425 (10Dzahn) [18:44:41] 10ops-codfw, 06DC-Ops: msw-d3-codfw & msw-d8-codfw offline - https://phabricator.wikimedia.org/T398858#10980427 (10RobH) 05Openβ†’03Invalid dupe of https://phabricator.wikimedia.org/T398598 [18:50:58] (03Merged) 10jenkins-bot: Fix for validation error display in transformed chart data [extensions/Chart] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166895 (https://phabricator.wikimedia.org/T398597) (owner: 10Bvibber) [18:51:12] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1166895|Fix for validation error display in transformed chart data (T398597)]] [18:51:15] T398597: Transformed .chart pages crash when the underlying .tab page contains null values - https://phabricator.wikimedia.org/T398597 [18:53:18] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1166895|Fix for validation error display in transformed chart data (T398597)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:53:55] confirmed good [18:53:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1179.eqiad.wmnet with OS bullseye [18:54:03] !log bvibber@deploy1003 bvibber: Continuing with sync [18:54:05] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10980516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1179.eqiad.wmnet with OS bullseye executed with... [18:55:41] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [18:55:44] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [18:55:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye [18:55:57] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10980536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1179.eqiad.wmnet with OS bullseye [18:58:31] !log sukhe@cp7006:/var/run/confd-template$ sudo rm _etc_haproxy_conf.d_tls.cfg.err [18:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:53] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166895|Fix for validation error display in transformed chart data (T398597)]] (duration: 08m 40s) [18:59:55] T398597: Transformed .chart pages crash when the underlying .tab page contains null values - https://phabricator.wikimedia.org/T398597 [19:02:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:10:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10980567 (10VRiley-WMF) 05Openβ†’03Resolved I'm going to go ahead and close this for the time being. If there is anything else you'd like us to check, let us know! Thanks! [19:10:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1179.eqiad.wmnet with reason: host reimage [19:12:08] (03CR) 10Andriy.v: "So the solution is to remove temporary-account-viewer from default and grant it to all sysop groups except ukwiki sysops, or is there some" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [19:13:07] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398842#10980599 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Rebalanced power [19:14:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1179.eqiad.wmnet with reason: host reimage [19:15:34] (03CR) 10Dreamy Jazz: [C:04-1] "Maybe. Perhaps something in `CommonSettings.php` would be better that decides based on the current wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166499 (https://phabricator.wikimedia.org/T398738) (owner: 10Dreamrimmer) [19:16:11] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T398843#10980607 (10VRiley-WMF) 05Openβ†’03Resolved a:03VRiley-WMF Rebalanced power [19:28:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980634 (10Jhancock.wm) looks like part of the problem was a tripped breaker in D3. still investigating the rest and checking ser... [19:28:23] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.04 ms [19:28:50] (03CR) 10Cwhite: [C:03+2] logstash: test filter_on_template_v2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1165586 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [19:29:03] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.79 ms [19:30:15] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: PSU issue on es2044 - https://phabricator.wikimedia.org/T398601#10980643 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm tripped breaker in D3. fixed. T398598 [19:30:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:31:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:31:11] (03PS1) 10Zabe: Straight join collation table to make sure it is last [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) [19:31:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:31:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1179.eqiad.wmnet with OS bullseye [19:31:31] jouncebot: nowandnext [19:31:32] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [19:31:32] In 0 hour(s) and 28 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T2000) [19:31:32] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Investigate dead an-worker host an-worker1176 - https://phabricator.wikimedia.org/T398613#10980650 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1179.eqiad.wmnet with OS bullseye completed: - a... [19:31:43] (03CR) 10Zabe: [C:03+2] "Testing on mwdebug if this works" [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) (owner: 10Zabe) [19:34:23] RECOVERY - Host ps1-d8-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [19:34:31] RECOVERY - Host ssw1-d8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms [19:34:36] zabe: for next time, you can just do vim on mwdebug1002 xD you need to do something like this: "sudo -u mwdeploy vim /srv/mediawiki/php-1.45.0-wmf.5/includes/libs/filebackend/FileBackendMultiWrite.php" [19:34:45] RECOVERY - Host lsw1-d8-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [19:35:08] Amir1: no, the physical mwdebug hosts got decomissioned .... [19:35:23] for later, it should be with mw-experimental: https://wikitech.wikimedia.org/wiki/Mw-experimental [19:35:39] Ah [19:35:43] TIL [19:35:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:37:05] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2220.mgmt:22 - https://phabricator.wikimedia.org/T398587#10980668 (10Jhancock.wm) rebooted mgmt switch in D8. pings. [19:37:08] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2220.mgmt:22 - https://phabricator.wikimedia.org/T398587#10980669 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm [19:37:30] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2195.mgmt:22 - https://phabricator.wikimedia.org/T398586#10980673 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [19:38:00] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2174.mgmt:22 - https://phabricator.wikimedia.org/T398585#10980678 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [19:38:32] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2198.mgmt:22 - https://phabricator.wikimedia.org/T398584#10980689 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [19:38:33] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [19:39:14] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [19:39:28] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2048.mgmt:22 - https://phabricator.wikimedia.org/T398583#10980695 (10Jhancock.wm) reset the tripped breaker in D3. pings. [19:39:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for es2048.mgmt:22 - https://phabricator.wikimedia.org/T398583#10980698 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm [19:40:09] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2186.mgmt:22 - https://phabricator.wikimedia.org/T398582#10980703 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm reset the tripped breaker in D3. pings [19:40:31] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for wikikube-worker2221.mgmt:22 - https://phabricator.wikimedia.org/T398581#10980708 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm reset tripped breaker in D3. pings. [19:40:45] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for db2219.mgmt:22 - https://phabricator.wikimedia.org/T398580#10980725 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [19:41:15] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for thanos-be2005.mgmt:22 - https://phabricator.wikimedia.org/T398579#10980730 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm reset tripped breaker in D3. pings. [19:41:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for puppetserver2004.mgmt:22 - https://phabricator.wikimedia.org/T398578#10980735 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm rebooted mgmt switch in D8. pings. [19:41:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#10980738 (10Jclark-ctr) ticket declined by Dell new one is SR212469803 [19:42:58] (03CR) 10Andrew Bogott: [C:03+2] Openstack common/servicetoken.erb: remove a misleading comment [puppet] - 10https://gerrit.wikimedia.org/r/1143611 (owner: 10Andrew Bogott) [19:43:46] (03CR) 10Andrew Bogott: [C:03+2] partman_early_command: don't wipe out lvm for cloudcephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1166852 (https://phabricator.wikimedia.org/T306820) (owner: 10Andrew Bogott) [19:44:17] (03CR) 10CI reject: [V:04-1] Straight join collation table to make sure it is last [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) (owner: 10Zabe) [19:44:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10980766 (10Jhancock.wm) reset the tripped breaker in D3. On the secondary switch. No indiciation of a simiilar issue in D8. possi... [19:47:32] prior: https://performance.wikimedia.org/excimer/profile/a36456a485e04a37 [19:47:37] later: https://performance.wikimedia.org/excimer/profile/34fa59f1aa285929 [19:49:51] actual later: https://performance.wikimedia.org/excimer/profile/f9c463ba2e220e74 [19:49:59] The first one apparently used some cached result [19:50:59] Amir1: I do not really know how to properly test this besides taking a look at some examples which are currently slow [19:51:43] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [19:53:27] zabe: that's good enough IMHO [19:53:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166855 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [19:54:27] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:49] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast6003.wikimedia.org to drbd [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T2000). [20:00:05] ebernhardson and Krinkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] \o [20:00:20] i suppose can deploy my own [20:00:39] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:17] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54224 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163415 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [20:02:32] (03Merged) 10jenkins-bot: cirrus: Start AB test of completion suggester fuzziness [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163415 (https://phabricator.wikimedia.org/T397732) (owner: 10Ebernhardson) [20:02:47] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1163415|cirrus: Start AB test of completion suggester fuzziness (T397732)]] [20:02:50] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [20:03:57] o/ [20:05:15] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1163415|cirrus: Start AB test of completion suggester fuzziness (T397732)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:38] (03PS1) 10Gmodena: dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) [20:07:40] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:08:00] (03PS1) 10Krinkle: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) [20:08:39] (03CR) 10Dreamy Jazz: [C:03+1] temp accounts: Separate digits in user names with hyphens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166791 (https://phabricator.wikimedia.org/T381845) (owner: 10Tchanders) [20:08:47] (03CR) 10CI reject: [V:04-1] beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:08:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10980844 (10BCornwall) Sorry for the delay; I had to take a few unexpected days off but will get back to this shortly! [20:09:53] (03PS1) 10Gmodena: services: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) [20:10:05] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [20:10:08] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [20:11:16] (03PS2) 10Gmodena: services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) [20:11:42] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [20:12:37] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [20:13:15] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163415|cirrus: Start AB test of completion suggester fuzziness (T397732)]] (duration: 10m 28s) [20:13:17] T397732: Run a test evaluating fuzziness of completion suggester - https://phabricator.wikimedia.org/T397732 [20:16:23] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [20:16:27] Krinkle: mine should be done, spiderpig is available [20:16:38] ack, using scap. [20:16:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166855 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:17:46] (03Merged) 10jenkins-bot: interwiki-labs.php: Regenerate interwiki map (switch to beta.wmcloud.org) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166855 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:19:24] PROBLEM - Host bast6003 is DOWN: PING CRITICAL - Packet loss = 100% [20:19:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast6003.wikimedia.org to drbd [20:20:16] RECOVERY - Host bast6003 is UP: PING OK - Packet loss = 0%, RTA = 87.79 ms [20:33:28] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [20:36:43] (03CR) 10TChin: [C:03+1] services: mw-page-content-change-enrich: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166923 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [20:36:51] (03CR) 10TChin: [C:03+1] dse: mw-content-history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1166921 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [20:38:43] (03PS2) 10Krinkle: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) [20:40:25] (03PS3) 10Krinkle: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) [20:40:57] (03PS4) 10Krinkle: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) [20:41:51] (03PS5) 10Krinkle: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) [20:42:12] (03PS2) 10Zabe: Straight join collation table to make sure it is last [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) [20:46:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:48:58] (03Merged) 10jenkins-bot: beta: Move all non-wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1166922 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [20:55:28] (03CR) 10Zabe: [C:03+2] Straight join collation table to make sure it is last [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) (owner: 10Zabe) [20:56:40] FIRING: [2x] SystemdUnitFailed: docker-registry.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:45] (03Merged) 10jenkins-bot: Straight join collation table to make sure it is last [core] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166915 (https://phabricator.wikimedia.org/T398860) (owner: 10Zabe) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T2100). [21:00:47] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1166915|Straight join collation table to make sure it is last (T398860)]] [21:00:51] T398860: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398860 [21:00:58] (03PS4) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [21:01:35] !log ladsgroup@cumin1002 START - Cookbook sre.wikireplicas.update-views [21:02:48] !log zabe@deploy1003 zabe: Backport for [[gerrit:1166915|Straight join collation table to make sure it is last (T398860)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:05:12] (03PS5) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [21:05:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [21:05:54] !log zabe@deploy1003 zabe: Continuing with sync [21:06:43] (03PS6) 10Krinkle: deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) [21:11:21] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1166915|Straight join collation table to make sure it is last (T398860)]] (duration: 10m 33s) [21:11:24] T398860: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T398860 [21:12:48] (03PS3) 10Krinkle: beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) [21:23:37] preparing to run scap for a security deploy [21:24:04] (03CR) 10Krinkle: [C:04-1] "I'ver reverted beta puppetserver back to PS3. For some reason, the latest PS breaks the www portals." [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:28:59] (03CR) 10Krinkle: [C:04-1] "In particular, the last PS causes https://www.wikipedia.beta.wmflabs.org to redirect to Incubator/Wp/Www instead of serving www portal." [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [21:36:47] !log Deployed security fix for T398636 [21:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:57] (03CR) 10Cwhite: [C:03+1] alerting_host: set puppet agent to 5m interval [puppet] - 10https://gerrit.wikimedia.org/r/1166846 (https://phabricator.wikimedia.org/T398444) (owner: 10Herron) [21:49:16] preparing to run scap again for a security deploy [21:52:59] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [21:54:13] (03CR) 10Cwhite: [C:03+2] logstash: deploy phatality 2.7.0.3 to production [puppet] - 10https://gerrit.wikimedia.org/r/1165579 (https://phabricator.wikimedia.org/T398305) (owner: 10Cwhite) [21:58:27] !log Deployed security fix for T397577 [21:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:43] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10981216 (10Jclark-ctr) @VRiley-WMF Did you open up ticket with Dell Same error that we have been seeing on R450's ` The System Configuration Check operation resulted in the following issue: Co... [22:05:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1237 is not booting up - https://phabricator.wikimedia.org/T398794#10981234 (10Jclark-ctr) Also idrac firmware is outdated most up todate is 7.20.30.50 server currently has iDRAC Firmware Version 7.00.00.00 Bios is also outdated most up todate is 1.17.2, 1.... [22:11:15] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:13:15] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [22:16:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:58] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [22:21:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:21:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:24] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [22:51:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:52:30] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:53:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:54:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:58:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:58:45] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster search_eqiad: activate new plugins packages - bking@cumin1002 - T397227 [22:58:49] T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227 [22:58:54] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:59:28] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250707T2300) [23:03:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:04:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:32:08] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544#10981527 (10Dzahn) [23:32:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw: management down to racks D3 and D8 (switch port down) - https://phabricator.wikimedia.org/T398598#10981528 (10Dzahn) [23:33:11] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Unresponsive management for gerrit2003.mgmt:22 - https://phabricator.wikimedia.org/T398544#10981532 (10Dzahn) This is not specific to gerrit2003. Affects mgmt in 2 specific codfw racks. -> T398598 [23:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166941 [23:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166941 (owner: 10TrainBranchBot) [23:38:30] (03CR) 10Dzahn: [C:03+1] "So.. these checks were actually different until recently. Different parameters because some sites were still running on legacy VMs and oth" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:41:08] (03CR) 10Dzahn: [C:03+1] "compiler says needs some rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:46:08] (03PS3) 10Dzahn: microsites: refactor blackbox checks to use resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:46:33] (03CR) 10Dzahn: "rebased on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1166073 due to https://phabricator.wikimedia.org/T398528" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:46:46] (03CR) 10CI reject: [V:04-1] microsites: refactor blackbox checks to use resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:48:03] (03PS1) 10Bvibber: Support null values in data columns in transform output [extensions/JsonConfig] (wmf/1.45.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1166942 (https://phabricator.wikimedia.org/T398597) [23:48:56] (03CR) 10Dzahn: "rebased on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161509 top of this. this patch by godog will simplify the monitoring code" [puppet] - 10https://gerrit.wikimedia.org/r/1166073 (https://phabricator.wikimedia.org/T398528) (owner: 10Jelto) [23:50:42] (03CR) 10Dzahn: [V:04-1] "CI issue: Unable to find fact file for: miscweb2003.codfw.wmnet ;[ is because we decom'ed these and moved the class to new hosts.. fixin" [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi) [23:51:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1166941 (owner: 10TrainBranchBot) [23:57:44] (03PS4) 10Dzahn: microsites: refactor blackbox checks to use resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/1161509 (owner: 10Filippo Giunchedi)