[00:02:55] jhancock@cumin1003 provision (PID 3053871) is awaiting input [00:03:30] FIRING: [8x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4e [00:04:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4f [00:04:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 (owner: 10Stang) [00:05:26] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:13] jhancock@cumin1003 provision (PID 3056449) is awaiting input [00:08:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492 [00:08:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492 (owner: 10TrainBranchBot) [00:10:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:10:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:11:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:15:23] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:15:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4f [00:15:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.50 [00:18:36] jhancock@cumin1003 provision (PID 3056821) is awaiting input [00:25:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:25:43] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:29:15] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492 (owner: 10TrainBranchBot) [00:29:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.50 [00:29:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.51 [00:29:40] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [00:29:49] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005-dev.codfw.wmnet with OS... [00:42:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:42:46] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2006-dev'] [00:42:55] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2006-dev'] [00:43:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [00:43:29] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2006-dev.codfw.wmnet with OS... [00:43:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:44:23] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [00:44:31] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2007-dev.codfw.wmnet with OS... [00:44:41] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.51 [00:44:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.52 [00:45:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [00:46:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/0a99d5a1b686396d5c351ea7dc4d928f57630c612633dd8fdbc18679486af8a0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:48:45] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [00:59:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [01:00:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [01:02:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.52 [01:02:09] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.53 [01:02:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [01:05:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [01:06:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:12:14] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:15:18] jhancock@cumin1003 reimage (PID 3059175) is awaiting input [01:15:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.53 [01:16:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.54 [01:22:34] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:25:38] jhancock@cumin1003 reimage (PID 3059765) is awaiting input [01:27:50] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:29:47] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.54 [01:29:50] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.55 [01:30:54] jhancock@cumin1003 reimage (PID 3059700) is awaiting input [01:31:33] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:31:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [01:31:36] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:31:36] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [01:31:38] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [01:31:39] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [01:31:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2005-dev.codfw.wmnet with OS bul... [01:31:45] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2006-dev.codfw.wmnet with OS bul... [01:31:46] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2007-dev.codfw.wmnet with OS bul... [01:35:26] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945602 (10Jhancock.wm) 05Open→03Resolved [01:35:47] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945605 (10Jhancock.wm) @Andrew done! [01:37:06] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10945606 (10Jhancock.wm) @volans give it a shot on cp2044. if you have any issues with it, lmk [01:43:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.55 [01:43:20] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.56 [01:43:49] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [01:44:11] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [01:58:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.56 [01:58:07] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.57 [02:12:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.57 [02:12:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.58 [02:26:21] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3633 MB (3% inode=98%): /tmp 3633 MB (3% inode=98%): /var/tmp 3633 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [02:26:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.58 [02:26:43] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.59 [02:33:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 120211312 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:34:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4256016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:41:08] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.59 [02:41:11] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5a [02:44:28] FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:54] (03PS1) 10Krinkle: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318) [02:55:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5a [02:55:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5b [02:58:53] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 158335184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:59:53] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6059584 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:02:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:07:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:09:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5b [03:09:03] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5c [03:23:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5c [03:23:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5d [03:36:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5d [03:36:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5e [03:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:46:21] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3625 MB (3% inode=98%): /tmp 3625 MB (3% inode=98%): /var/tmp 3625 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [03:51:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5e [03:52:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5f [03:54:44] (03PS7) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) [03:54:44] (03PS4) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) [03:54:44] (03PS4) 10Scott French: hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) [03:54:45] (03PS5) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [03:57:15] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [03:57:26] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [04:03:30] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5f [04:06:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.60 [04:08:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:13:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:20:44] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.60 [04:20:47] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.61 [04:34:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.61 [04:34:24] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.62 [04:49:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.62 [04:49:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.63 [05:05:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.63 [05:05:20] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.64 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.64 [05:18:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.65 [05:19:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [05:33:12] (03Abandoned) 10Stang: Fix missing Chinese translation related to temporary accounts [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 (owner: 10Stang) [05:33:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.65 [05:34:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.66 [05:47:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.66 [05:47:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.67 [05:54:28] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0600) [06:02:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.67 [06:02:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.68 [06:06:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:12:45] 06SRE, 06Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937#10945822 (10Morale99) For anyone dealing with fiber connections or testing networks, using a good quality [[ https://www.firefold.com/collections/fibe... [06:16:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.68 [06:16:31] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.69 [06:21:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:30:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.69 [06:30:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6a [06:42:43] (03PS1) 10Arnaudb: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) [06:44:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6a [06:44:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6b [06:51:42] (03CR) 10Jelto: "looks mostly good, two comments in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [06:53:04] (03PS1) 10Muehlenhoff: Record extended contract date for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1163629 [06:54:48] (03CR) 10Muehlenhoff: [C:03+2] Record extended contract date for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1163629 (owner: 10Muehlenhoff) [06:57:18] (03PS1) 10Samwilson: Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 [06:57:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson) [06:59:20] (03CR) 10Jelto: "Thanks for adding the alert! Two suggestions in line" [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb) [06:59:42] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.6b [06:59:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6c [07:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0700). Please do the needful. [07:00:04] suzannewoodWMDE2, isaranto, Kizule, and samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:13] o/ [07:00:18] I am here [07:00:22] Same [07:01:16] I also am here [07:05:23] is anyone deploying? is it ok if I start my patch? [07:08:56] isaranto: I'm not sure who's deploying today. Amir1, Urbanecm, or awight are any of you around? [07:10:35] (03PS1) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) [07:10:52] (03PS5) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) [07:12:45] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6c [07:12:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6d [07:12:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [07:13:10] isaranto: are you deploying? [07:13:35] no but I can start! [07:14:01] sounds good to me [07:14:12] starting! [07:14:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos) [07:15:33] (03Merged) 10jenkins-bot: ores-extension: enable revertrisk filter in UI for third batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos) [07:16:12] !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]] [07:16:17] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:16:20] (03CR) 10Gergő Tisza: [C:03+1] Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [07:17:35] (03PS1) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) [07:18:34] !log isaranto@deploy1003 isaranto: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:18:41] testing! [07:19:26] (03PS1) 10Stevemunene: hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176) [07:19:39] isaranto: are you able to do the other config patches in the window as well? [07:19:40] (03CR) 10CI reject: [V:04-1] debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:20:37] (03PS2) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) [07:21:33] (03CR) 10Kosta Harlan: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:22:23] suzannewoodWMDE2: I have a question for you on the config patch above ^ [07:22:32] ok! [07:23:06] !log isaranto@deploy1003 isaranto: Continuing with sync [07:23:35] (03CR) 10Majavah: [C:03+1] Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup) [07:23:42] sry I was QAing [07:23:51] (03Abandoned) 10Cathal Mooney: Netbox hosts: add netbox-dns reposync repo so it is available [puppet] - 10https://gerrit.wikimedia.org/r/1163382 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [07:23:58] (03PS1) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) [07:24:33] (03CR) 10Suzanne Wood: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:24:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [07:24:59] kostajh: I can deploy other patches as well, but I'll have to go in 30' [07:25:11] I'm taking a look at the other patches atm [07:25:28] (03PS2) 10Joely Rooke WMDE: Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) [07:25:56] (03PS4) 10Arnaudb: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) [07:26:09] (03CR) 10Elukey: [C:03+1] Remove external cloud sync from Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [07:26:11] (03PS2) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) [07:26:12] (03PS2) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) [07:26:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6d [07:26:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6e [07:26:23] (03CR) 10Suzanne Wood: [C:03+1] Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:26:52] isaranto: cool. I need a few more minutes on mine [07:26:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [07:27:08] (03CR) 10Kosta Harlan: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:27:09] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [07:27:10] tbh I'd prefer not too. the rest of the patches haven't been reviewed [07:27:47] ok, I don't mind to do them [07:28:05] thank you kostajh <3 [07:28:15] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet [07:28:22] suzannewoodWMDE2: are you able to verify your change when it's deployed? [07:28:29] same question for samwilson [07:28:52] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10945907 (10elukey) Done! The patch is ready to go in my opinion, thanks! [07:29:15] Thanks! : ) We've addressed your comment so 1163372 is ready. Yes we can verify when it's deployed [07:29:18] kostajh: yep, I can verify [07:29:19] (03PS2) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) [07:29:29] (03PS3) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) [07:30:12] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [07:30:20] !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]] (duration: 14m 08s) [07:30:26] T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824 [07:30:29] done! [07:30:45] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1149.eqiad.wmnet [07:31:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:31:21] (03CR) 10Muehlenhoff: "Looks great, one question/doubt inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [07:31:56] kostajh: you can go ahead [07:32:04] (03Merged) 10jenkins-bot: Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [07:32:25] (03CR) 10Gergő Tisza: [C:03+1] Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [07:32:27] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]] [07:32:35] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [07:33:30] (03CR) 10Jcrespo: [C:03+2] dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [07:34:43] !log kharlan@deploy1003 joelyrookewmde, kharlan: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:36:06] suzannewoodWMDE2: please verify on mwdebug [07:36:27] It works! [07:37:21] (03PS1) 10Slyngshede: P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) [07:38:03] !log kharlan@deploy1003 joelyrookewmde, kharlan: Continuing with sync [07:38:07] cool :) [07:38:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [07:39:57] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.6e [07:40:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6f [07:40:56] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1175.eqiad.wmnet [07:42:28] (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb) [07:42:41] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1175.eqiad.wmnet [07:42:45] (03CR) 10Arnaudb: [C:03+2] mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb) [07:44:00] (03Merged) 10jenkins-bot: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb) [07:44:12] (03CR) 10Jelto: "lgtm now, thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [07:45:09] (03PS1) 10Stevemunene: hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176) [07:45:30] (03CR) 10Volans: kubernetes: add a new kubernetes section (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [07:45:30] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]] (duration: 13m 03s) [07:45:36] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [07:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:46:11] alright, on to samwilson's patch [07:46:42] Thanks! [07:46:44] Kizule: I'll sync yours at the same time as well [07:46:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson) [07:46:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21) [07:47:41] Oh, I'm still here. [07:47:43] (03Merged) 10jenkins-bot: Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson) [07:47:46] (03Merged) 10jenkins-bot: Enable block feature for AbuseFilter on all small Serbian wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21) [07:48:04] (03PS1) 10Stevemunene: hdfs: set an-worker1176 to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163692 (https://phabricator.wikimedia.org/T390176) [07:48:08] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]] [07:48:13] T392363: Enable block feature for AbuseFilter on all small Serbian wikiprojects - https://phabricator.wikimedia.org/T392363 [07:48:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:48:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10945954 (10Stevemunene) `an-worker1175` had the drives in an UGood state `... [07:48:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10945955 (10Stevemunene) [07:48:59] suzannewoodWMDE2: I think we need to roll back your patch [07:49:56] suzannewoodWMDE2: https://logstash.wikimedia.org/goto/b55135916319237318f0f77abeed4093 [07:49:57] Ok, what's the problem? [07:50:03] I should have checked the logs during deploy, my fault. [07:50:25] !log kharlan@deploy1003 zoranzoki21, kharlan, samwilson: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:50:42] (03PS1) 10Kosta Harlan: Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 [07:50:43] (03PS1) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [07:50:51] samwilson / Kizule please verify your changes [07:50:57] (03CR) 10Volans: [C:03+1] "Nice LGTM, would be nice to complete the test coverage. Not a blocker." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [07:50:57] kostajh: Mine is good to go. [07:51:07] +1 [07:51:16] (03CR) 10Jcrespo: [C:04-2] "Backups have not completed yet, wait." [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [07:51:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan) [07:51:46] Oh yeah we see the error, thanks for reverting [07:52:42] (03CR) 10Cmelo: Release the CampaignEvents extension to all Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [07:52:49] (03CR) 10Cmelo: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [07:52:56] samwilson: are we OK to proceed? [07:53:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:53:22] kostajh: yep! [07:53:33] !log kharlan@deploy1003 zoranzoki21, kharlan, samwilson: Continuing with sync [07:55:04] (03CR) 10Volans: [C:03+2] redfish: add support for iDRAC 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [07:55:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6f [07:55:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.70 [07:58:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:58:30] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800) [08:00:45] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]] (duration: 12m 37s) [08:00:51] T392363: Enable block feature for AbuseFilter on all small Serbian wikiprojects - https://phabricator.wikimedia.org/T392363 [08:01:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan) [08:01:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:01:58] (03Merged) 10jenkins-bot: Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan) [08:02:06] expected? [08:02:12] !incidents [08:02:12] 6427 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet esams) [08:02:13] vgutierrez: yes, reverting a change [08:02:18] !ack 6427 [08:02:18] 6427 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet esams) [08:02:23] thx kostajh [08:02:25] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]] [08:02:41] <_joe_> not expected but we know why [08:02:58] _joe_: expected as in "we know what's going on" :) [08:03:11] * hnowlan here [08:03:18] ah :) [08:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:09] (03Merged) 10jenkins-bot: redfish: add support for iDRAC 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [08:04:12] yeah, sorry. for the next time: how should I alert SRE that we know the cause and are in process of reverting a patch? [08:04:38] pinging us here or -sre should be enough [08:04:38] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:05:00] ack [08:05:04] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162908 (owner: 10PipelineBot) [08:05:22] I'm around if you need help (oncall) [08:05:28] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [08:05:29] !log kharlan@deploy1003 kharlan: Continuing with sync [08:05:31] (03CR) 10Volans: [C:03+2] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [08:05:39] this should resolve shortly [08:05:49] k8s-willing [08:06:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:06:54] <_joe_> it's going down fast [08:06:57] <_joe_> heh [08:07:04] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162908 (owner: 10PipelineBot) [08:08:56] (03PS1) 10Jelto: cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548) [08:08:58] I started a discussion in -developer-experience on Slack about including mediawiki-debug messages in the spiderpig.wikimedia.org UI [08:09:07] was that the same alert going twice or is there a difference? [08:09:38] jynus: different PoPs [08:10:09] there is another logspam issue fwiw with Extension:Cite (https://logstash.wikimedia.org/goto/21ee9b1086f65aa9d536247d2d159a5c) but not related to this deployment window [08:10:17] thanks, it wasn't clear on the msg to me [08:10:46] yeah.. we should include the site [08:10:58] (03PS1) 10Hashar: Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760) [08:11:05] T397760 is the other logspam issue [08:11:06] T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760 [08:11:42] o/ [08:11:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:12:05] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.70 [08:12:08] I am happy to backport that log spam patch now if that can help, but I don't think it is related to whatever is ongoing right now [08:12:08] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.71 [08:12:44] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]] (duration: 10m 18s) [08:12:59] it's not [08:13:19] the error messages related to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1163372 should be resolved [08:13:24] one more config patch to go in this window [08:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:31] jouncebot: nowandnext [08:13:31] For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800) [08:13:31] In 1 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000) [08:14:11] +1 for deployment [08:14:12] (03PS1) 10Vgutierrez: ATSBackendErrorsHigh: Report the impacted site on summary [alerts] - 10https://gerrit.wikimedia.org/r/1163698 [08:14:13] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760) (owner: 10Hashar) [08:14:13] I will try not to break everything this time [08:14:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [08:14:32] the train will be run later tonight by Jeena (she is on US west coast) [08:15:15] breaking stuff is fine, as long as you fix it :] [08:15:19] (03Merged) 10jenkins-bot: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan) [08:15:41] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]] [08:15:46] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [08:15:52] (03Merged) 10jenkins-bot: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [08:17:45] (03PS3) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [08:17:53] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:18:38] (03CR) 10Volans: [C:03+2] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [08:19:45] (03CR) 10Suzanne Wood: [C:03+1] "What happened was:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [08:20:32] !log kharlan@deploy1003 kharlan: Continuing with sync [08:21:46] (03PS4) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [08:21:48] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [08:22:05] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10946089 (10Fabfur) User added to the phabricator "nda" group [08:22:09] (03PS3) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) [08:23:02] (03PS1) 10Hnowlan: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) [08:23:26] (03PS2) 10Herron: admin: add ldap_only entry for derhexer [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) [08:24:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [08:24:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:24:55] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 2 (backup1013, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:24:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron) [08:25:31] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [08:25:48] (03CR) 10Btullis: [C:03+1] hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene) [08:25:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.71 [08:26:01] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.72 [08:26:16] (03CR) 10Btullis: [C:03+1] hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene) [08:26:42] Sorry all! Forgot that a crucial part of the change for 1163372 is still on this week's train and not deployed to all pilot wikis where the feature was activated. [08:26:50] (03CR) 10Stevemunene: [C:03+2] hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene) [08:27:00] (03CR) 10Fabfur: [C:03+2] admin: add ldap_only entry for derhexer [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron) [08:27:25] (03PS4) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [08:27:33] (03CR) 10Muehlenhoff: kubernetes: add a new kubernetes section (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [08:28:01] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]] (duration: 12m 19s) [08:28:06] T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204 [08:28:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [08:28:22] (03PS5) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [08:28:23] (03CR) 10Muehlenhoff: [C:03+2] debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [08:28:49] joelyrookewmde: it's ok, thanks for commenting on the task and hopefully the next deployment is smoother :) [08:28:59] (03CR) 10Hashar: "No worries @suzanne.wood@wikimedia.de, can you copy paste this comment on the Phabricator task T388685 please? That will help discovery la" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [08:29:00] fabfur: I'll merge your data.yaml patch along [08:29:15] (03CR) 10CI reject: [V:04-1] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [08:29:27] (03Merged) 10jenkins-bot: Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760) (owner: 10Hashar) [08:29:40] joelyrookewmde: it is perfectly fine no worries. Ideally that should have been caught by a test that ensures the config setting works with both deployed versions but we do not have such testing system :] [08:29:50] joelyrookewmde: that got caught and rolled back. It is fine! [08:30:05] !log UTC morning deploys done [08:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:29] (03PS2) 10Hnowlan: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) [08:30:50] I am deploying the Cite path [08:31:42] (03CR) 10Ayounsi: "Addressed all the comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [08:31:47] hmm or maybe Thiemo is on it [08:32:24] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]] [08:32:27] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [08:32:29] T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760 [08:32:51] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [08:33:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu... [08:34:33] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [08:34:35] !log hashar@deploy1003 hashar: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:35:25] !log hashar@deploy1003 hashar: Continuing with sync [08:36:04] (03PS5) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [08:36:09] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [08:36:11] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10946159 (10Fabfur) Hello, the user has been added to the "nda" ldap group, can you please try and confirm you can now access the needed resources? [08:36:38] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [08:38:15] (03CR) 10Joely Rooke WMDE: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [08:40:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.72 [08:40:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.73 [08:41:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [08:42:12] (03PS1) 10Joely Rooke WMDE: Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 [08:42:15] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]] (duration: 09m 51s) [08:42:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [08:42:20] T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760 [08:44:07] (03CR) 10Joely Rooke WMDE: "Scheduling this for backport in afternoon window of Thursday, 26th June 2025, after all groups have been pushed to 1.45.0-wmf.7 (contains " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE) [08:44:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE) [08:49:40] (03CR) 10FNegri: [C:03+1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [08:51:20] (03PS6) 10Ayounsi: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 [08:52:05] (03CR) 10David Caro: [V:03+1 C:03+2] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [08:53:11] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [08:54:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.73 [08:54:26] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.74 [08:55:27] (03CR) 10Clément Goubert: [C:03+1] mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [08:57:39] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye [08:57:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946260 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1... [08:59:58] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:01:00] (03PS6) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [09:01:20] (03CR) 10Hnowlan: [C:03+2] mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [09:03:04] (03Merged) 10jenkins-bot: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [09:03:40] (03CR) 10Volans: [C:03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [09:03:45] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet [09:04:38] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:04:52] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [09:05:48] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [09:06:04] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [09:06:50] (03CR) 10Cathal Mooney: "An alias didn't do the trick, it would just pick the empty array from hieradata/common/profile/spicerack/reposync.yaml. No great options " [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:09:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.74 [09:09:14] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.75 [09:09:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet [09:10:35] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [09:12:57] (03Merged) 10jenkins-bot: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [09:15:05] (03CR) 10Fabfur: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1163698 (owner: 10Vgutierrez) [09:16:37] (03CR) 10Vgutierrez: [C:03+2] ATSBackendErrorsHigh: Report the impacted site on summary [alerts] - 10https://gerrit.wikimedia.org/r/1163698 (owner: 10Vgutierrez) [09:19:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [09:22:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.75 [09:22:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.76 [09:24:16] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [09:24:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu... [09:25:27] (03PS1) 10Elukey: admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696) [09:25:29] (03PS1) 10Elukey: admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) [09:25:30] (03PS1) 10Elukey: aux/dse: remove the usage of sha256 digest image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696) [09:25:39] (03PS1) 10Volans: CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714 [09:26:08] (03CR) 10Jelto: [C:03+1] gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [09:27:16] (03CR) 10Arnaudb: [C:03+2] gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [09:27:50] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:28:38] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714 (owner: 10Volans) [09:28:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:29:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [09:30:07] (03CR) 10Cathal Mooney: [C:03+2] Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [09:32:31] (03PS2) 10Slyngshede: P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) [09:33:31] (03CR) 10JMeybohm: [C:03+1] cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [09:33:37] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6062/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:33:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:34:26] 14 [09:34:28] iff [09:34:42] today is not my day [09:34:45] jouncebot: nowandnext [09:34:45] For the next 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800) [09:34:45] In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000) [09:34:57] if 14 was a password, probably time to change it anyway :) [09:34:58] * Amir1 gives coffee to elukey <3 [09:35:27] <3 [09:35:44] (03PS1) 10Vgutierrez: liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) [09:35:50] (03CR) 10Ladsgroup: [C:03+2] Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup) [09:36:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez) [09:36:16] (03CR) 10CI reject: [V:04-1] liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez) [09:36:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup) [09:36:37] (03Merged) 10jenkins-bot: Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup) [09:36:56] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-records Generate and push DNS records from Netbox data [09:37:01] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]] [09:37:04] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-records (exit_code=0) Generate and push DNS records from Netbox data [09:37:06] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [09:37:30] (03PS2) 10Vgutierrez: liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) [09:37:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.76 [09:37:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.77 [09:38:15] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714 (owner: 10Volans) [09:39:09] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:39:56] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [09:40:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez) [09:41:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:42:07] (03PS3) 10Slyngshede: P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) [09:42:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:43:02] (03CR) 10Klausman: [C:03+1] admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:43:28] (03CR) 10Klausman: [C:03+1] admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [09:44:10] (03CR) 10Cathal Mooney: "LGTM, some small nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:44:48] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye [09:45:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1... [09:45:10] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [09:45:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu... [09:46:37] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]] (duration: 09m 36s) [09:46:43] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [09:49:14] (03PS4) 10Slyngshede: P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) [09:51:03] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.77 [09:51:05] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.78 [09:51:16] (03PS5) 10Slyngshede: P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) [09:52:05] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6064/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:53:38] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2008.codfw.wmnet with reason: Maintenance and reboot [09:54:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:54:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:55:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6065/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:56:16] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:dns::auth::netbox Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [09:57:43] (03PS1) 10Volans: Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000) [10:00:06] (03CR) 10Volans: [C:03+2] Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716 (owner: 10Volans) [10:01:48] (03PS1) 10Slyngshede: P:dns::auth::netbox_dns_records fix branch name [puppet] - 10https://gerrit.wikimedia.org/r/1163717 (https://phabricator.wikimedia.org/T362985) [10:02:37] ACKNOWLEDGEMENT - Backup freshness on backup1014 is CRITICAL: All failures: 2 (backup1013, ...), Fresh: 140 jobs Jcrespo ongoing backups, expected - The acknowledgement expires at: 2025-06-27 10:02:16. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:04:12] (03CR) 10Slyngshede: [C:03+2] P:dns::auth::netbox_dns_records fix branch name [puppet] - 10https://gerrit.wikimedia.org/r/1163717 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [10:04:48] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.78 [10:04:50] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.79 [10:05:18] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye [10:05:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1... [10:07:36] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 [10:09:38] (03CR) 10Stevemunene: [C:03+2] hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene) [10:10:41] (03Merged) 10jenkins-bot: Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716 (owner: 10Volans) [10:11:02] the deployment is stuck in syncing to apaches (bare metals) [10:11:09] (03PS1) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 [10:11:16] for 35 minutes now [10:11:43] Amir1: huh [10:11:55] I think it might be actually my connection [10:11:57] one second [10:12:23] yup, my connection dropped and it wasn't moving forward, I reconnected and screen says it's finished [10:12:29] sigh, sorry for the false alarm [10:12:36] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: Add frban1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1163430 (https://phabricator.wikimedia.org/T395951) (owner: 10Dwisehaupt) [10:12:36] :D [10:13:33] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:13:46] !log dropping table job in group0 (T397367) [10:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [10:14:56] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1154.eqiad.wmnet [10:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946573 (10ops-monitoring-bot) Host an-worker1154.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [10:15:49] (03PS2) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) [10:16:40] jouncebot: nowandnext [10:16:40] For the next 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000) [10:16:40] In 0 hour(s) and 43 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100) [10:16:51] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:17:03] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:18:11] PROBLEM - Host wikikube-worker1069 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:02] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.79 [10:19:05] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7a [10:19:16] jouncebot: nowandnext [10:19:16] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000) [10:19:16] In 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100) [10:19:33] FIRING: KubernetesCalicoDown: wikikube-worker1069.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1069.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:19:37] I wouldn’t mind doing a Wikibase backport if that’s okay with everyone else (esp. hnowlan ig ^^) [10:20:21] (03CR) 10Suzanne Wood: [C:03+1] Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE) [10:21:18] Lucas_WMDE: no objections from me [10:21:26] (03PS1) 10Lucas Werkmeister (WMDE): Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506) [10:21:27] RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [10:21:35] (03PS1) 10Lucas Werkmeister (WMDE): Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506) [10:21:50] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1154.eqiad.wmnet [10:21:54] alright, I’ll backport ^ those two in a few minutes if I don’t hear any objections :) [10:22:03] (03PS1) 10Majavah: P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480) [10:22:17] (actually, on second thought, I’ll just start the backport now. that still leaves like at least 10 minutes for someone to object during the gate-and-submit build anyway :D [10:22:18] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1149.eqiad.wmnet [10:22:19] ) [10:22:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db1252 weight to 300 - see T385141', diff saved to https://phabricator.wikimedia.org/P78677 and previous config saved to /var/cache/conftool/dbconfig/20250625-102225-fceratto.json [10:22:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946593 (10ops-monitoring-bot) Host an-worker1149.eqiad.wmnet rebooted by st... [10:22:31] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [10:23:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE)) [10:23:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE)) [10:23:23] hmm, https://spiderpig.wikimedia.org/jobs/249 didn’t show me the Yes/No buttons to confirm until I reloaded the page [10:23:32] let’s see if it happens again or if it was just a hiccup [10:23:54] (I could see the “Backport the changes?” prompt in the terminal but the interactive part at the top of the page was missing) [10:25:22] (03CR) 10CI reject: [V:04-1] tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [10:26:12] (03PS2) 10Majavah: P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480) [10:27:34] (03CR) 10Btullis: [C:04-1] "We discussed this in #wikimedia-k8s-sig and on the dse side at least, we're not comfortable with this change. The use of checksums is to m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [10:28:11] (03PS1) 10Filippo Giunchedi: icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) [10:30:51] !log uploaded spicerack_11.1.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [10:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:29] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1149.eqiad.wmnet [10:32:36] (03PS2) 10Clément Goubert: P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) [10:32:40] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:32:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7a [10:32:55] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7b [10:34:39] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1150.eqiad.wmnet [10:34:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946631 (10ops-monitoring-bot) Host an-worker1150.eqiad.wmnet rebooted by stevemunene@cumin1002 wi... [10:37:10] !log Ran fixStuckGlobalRename.php for T397807 [10:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:16] T397807: Unblock stuck global rename of ReadMore - https://phabricator.wikimedia.org/T397807 [10:38:35] (03CR) 10CI reject: [V:04-1] icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [10:39:10] (03CR) 10Jelto: [C:03+2] cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [10:39:58] (03Merged) 10jenkins-bot: Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE)) [10:40:00] (03Merged) 10jenkins-bot: Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE)) [10:40:30] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]] [10:40:36] T397506: ScopedTypeaheadSearch - clicking the search button redirects to the main page - https://phabricator.wikimedia.org/T397506 [10:41:31] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1150.eqiad.wmnet [10:41:50] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet [10:42:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946643 (10ops-monitoring-bot) Host an-worker1151.eqiad.wmnet rebooted by stevemunene@cumin1002 wi... [10:42:40] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:43:39] works \o/ [10:43:42] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [10:43:51] and SpiderPig showed me the prompt correctly as well [10:44:42] (03CR) 10Hnowlan: [C:03+1] P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:44:52] (03CR) 10Clément Goubert: [C:03+2] P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert) [10:46:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7b [10:46:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7c [10:47:25] !log import kubernetes 1.31.4-6 to apt host - T387548 [10:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:30] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [10:47:44] (03PS1) 10Tchanders: WIP temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 [10:48:28] (03CR) 10Tchanders: [C:04-2] "Date and set of wikis to be confirmed. Needs comms approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (owner: 10Tchanders) [10:48:35] (03CR) 10CI reject: [V:04-1] WIP temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (owner: 10Tchanders) [10:49:04] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet [10:51:20] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2008.codfw.wmnet: Renew puppet certificate - root@cumin1002 [10:52:22] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]] (duration: 11m 52s) [10:52:28] T397506: ScopedTypeaheadSearch - clicking the search button redirects to the main page - https://phabricator.wikimedia.org/T397506 [10:52:48] (03PS1) 10Klausman: hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1163739 (https://phabricator.wikimedia.org/T335491) [10:52:49] * Lucas_WMDE done deploying [10:53:08] (03CR) 10Klausman: [V:03+2 C:03+2] hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1163739 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman) [10:53:44] (03PS2) 10Vgutierrez: hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) [10:56:04] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100) [11:00:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7c [11:00:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7d [11:05:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:05:56] (03PS1) 10Muehlenhoff: Depend on libjs-bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) [11:08:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.327s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:08:38] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:08:58] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:09:42] (03PS1) 10Hnowlan: mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) [11:13:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.327s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:14:34] (03CR) 10Clément Goubert: [C:03+1] mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:14:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7d [11:14:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7e [11:16:17] (03CR) 10Hnowlan: [C:03+2] mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:17:56] (03Merged) 10jenkins-bot: mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [11:20:25] (03PS1) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [11:20:44] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:20:47] (03PS2) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [11:21:04] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-eqiad [11:21:11] (03PS1) 10Clément Goubert: mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 [11:21:16] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:21:53] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:22:23] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:23:37] !log Manual powercycle of wikikube-worker1069.eqiad.wmnet [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:14] (03PS1) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [11:24:47] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:25:14] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:26:45] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [11:28:40] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829 (10Clement_Goubert) 03NEW [11:29:07] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [11:29:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7e [11:29:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7f [11:30:20] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:30:30] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:31:45] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:28] !log homer "cr*eqiad*" commit 'wikikube-worker1069 failed' - T397829 [11:35:33] (03PS27) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:34] T397829: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829 [11:35:56] (03PS2) 10Klausman: services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 [11:37:32] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts [11:37:38] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts [11:37:51] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 15 hosts [11:37:57] !log root@cumin1002 DONE (ERROR) - Cookbook sre.puppet.renew-cert (exit_code=97) for backup1008.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [11:37:58] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts [11:38:24] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1008.eqiad.wmnet with reason: Maintenance and reboot [11:38:38] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker1069.eqiad.wmnet with reason: hw failure [11:40:08] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1061-1068,1070-1075].eqiad.wmnet [11:40:11] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1061-1068,1070-1075].eqiad.wmnet [11:40:45] (03PS3) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) [11:40:45] (03PS2) 10Filippo Giunchedi: icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) [11:41:52] (03CR) 10Klausman: [C:03+2] services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 (owner: 10Klausman) [11:42:02] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:43:26] (03Merged) 10jenkins-bot: services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 (owner: 10Klausman) [11:43:42] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1076-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [11:44:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7f [11:44:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.80 [11:44:53] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:44:58] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:45:12] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:45:27] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:46:20] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:46:31] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832 (10Nahid) 03NEW [11:46:34] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:47:43] (03CR) 10Jcrespo: "It looks like a really bad idea to hardcode the events for the query killer on the code." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [11:48:08] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10946799 (10KLevan) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDsriWHRqsnwYuPFQvTiXHa1KNwrFYvRRnq1QQpEkpdmCxBbq+EQTKL4S9oTi8XjjCyDVt1lwswPQUTe2iBgMWrmGL3Ez+b9G1RY4MWWTw1IWP0ExSsOEQDZK8hzYbKA82eNpfW7N+jY8qv3WyPuVG6q4... [11:48:33] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob) [11:48:50] 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10946802 (10Nahid) [11:48:52] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:48:59] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:49:05] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:49:25] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:49:29] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:50:39] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob) [11:51:20] (03CR) 10Jcrespo: Add switchover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [11:51:37] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:52:21] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob) [11:52:44] (03PS28) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:52:45] (03CR) 10Jcrespo: [C:04-1] Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [11:53:36] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [11:53:49] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [11:54:07] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [11:54:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:54:23] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [11:54:30] (03CR) 10Jcrespo: [C:04-1] "As I said on IRC, much of this should go into the battle-tested db-switchover. Then the cookbook can handle the different steps separatell" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [11:54:41] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [11:54:57] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [11:55:37] (03Abandoned) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [11:56:26] (03PS2) 10Muehlenhoff: Depend on libjs-bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) [11:56:39] (03PS3) 10Muehlenhoff: Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) [11:57:32] (03PS29) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:57:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.80 [11:57:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.81 [11:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:59:26] (03CR) 10Hnowlan: [C:03+1] mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert) [12:01:28] (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert) [12:03:09] (03Merged) 10jenkins-bot: mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert) [12:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:46] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:03:51] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:03:59] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [12:04:03] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [12:04:19] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [12:04:21] (03CR) 10Federico Ceratto: "I'm summarizing here the discussion on irc with Jaime and Amir on wed 25 jun:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:04:26] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [12:04:35] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [12:07:46] (03Restored) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:11:03] 10SRE-tools, 10Spicerack: Flaky icinga unit tests - https://phabricator.wikimedia.org/T397833 (10fgiunchedi) 03NEW [12:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.696s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:11:33] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Flaky spicerack icinga unit tests - https://phabricator.wikimedia.org/T397833#10946842 (10fgiunchedi) [12:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:14:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1080:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:14:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.81 [12:14:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.82 [12:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.696s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:22:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:24:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1092:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:24:26] (03PS2) 10Ladsgroup: tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [12:24:31] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1152.eqiad.wmnet [12:24:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946860 (10ops-monitoring-bot) Host an-worker1152.eqiad.wmnet rebooted by stevemunene@cumin1002 wi... [12:24:50] (03CR) 10Ladsgroup: tables-catalog: add PageAssessments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [12:26:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:27:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:31] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:45] (03PS3) 10Ladsgroup: tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [12:28:47] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [12:28:49] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal) [12:29:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.82 [12:29:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.83 [12:29:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:30:34] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:30:52] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:31:53] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1152.eqiad.wmnet [12:32:15] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1153.eqiad.wmnet [12:32:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946882 (10ops-monitoring-bot) Host an-worker1153.eqiad.wmnet rebooted by stevemunene@cumin1002 wi... [12:34:17] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10946886 (10Volans) No way, it doesn't work yet, but I need to understand why: `lang=python >>> import xml.etree.ElementTree as ET >>> from xml.dom import minidom >>> sc... [12:34:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:37:40] (03PS30) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:38:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:39:06] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1153.eqiad.wmnet [12:39:31] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1175.eqiad.wmnet [12:39:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946892 (10ops-monitoring-bot) Host an-worker1175.eqiad.wmnet rebooted by stevemunene@cumin1002 wi... [12:40:04] (03PS31) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:42:25] (03PS23) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) [12:42:26] (03PS8) 10Arnaudb: gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034) [12:42:27] (03PS6) 10Arnaudb: gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034) [12:42:28] (03PS8) 10Arnaudb: gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034) [12:42:40] (03CR) 10Arnaudb: [C:03+2] gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:42:50] (03CR) 10Arnaudb: [C:03+2] gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:42:51] (03CR) 10Arnaudb: [C:03+2] gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:42:52] (03CR) 10Arnaudb: [C:03+2] gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:43:04] (03PS10) 10Arnaudb: gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034) [12:43:05] (03CR) 10Arnaudb: [C:03+2] gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:43:06] (03PS6) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [12:43:38] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.83 [12:43:41] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.84 [12:43:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:01] (03CR) 10Elukey: [C:03+2] admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [12:46:39] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1008.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [12:47:01] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [12:47:05] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1175.eqiad.wmnet [12:47:47] jynus: if that was you (sre.puppet.renew-cert on backup1008) please try to avoid to run cookbooks with double sudo ;) (yes I will make a patch for it at some point) [12:49:09] (03Merged) 10jenkins-bot: gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:49:29] volans: indeed, sorry [12:49:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946917 (10Stevemunene) The hosts have rejoined the cluster and the cluster is healthy {F62459029}... [12:50:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946918 (10Stevemunene) [12:50:35] (03Merged) 10jenkins-bot: gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:50:37] (03Merged) 10jenkins-bot: gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:50:38] (03Merged) 10jenkins-bot: gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:50:41] (03Merged) 10jenkins-bot: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:51:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946929 (10Stevemunene) `an-worker1149` was not upgraded as we did not have enough disks for the n... [12:51:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946933 (10Stevemunene) [12:51:54] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10946934 (10Jhancock.wm) [12:52:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946935 (10Stevemunene) 05Open→03Resolved [12:53:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946939 (10Stevemunene) an-worker1154 is back in the cluster, still working on an-worker1176 T390... [12:53:32] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10946943 (10Jhancock.wm) I need to change the preseed.yaml file so that sretest2005, sretest2006, sretest2009, and sretest2010 (just to cover some other servers in one go) have the same partman as sret... [12:56:00] (03Merged) 10jenkins-bot: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [12:58:25] (03PS1) 10Cathal Mooney: Authdns: add profile to role to clone new repo with netbox dns RRs [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [12:58:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.84 [12:58:57] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.85 [12:59:15] (03PS4) 10Alexandros Kosiaris: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 [12:59:38] (03PS2) 10Cathal Mooney: Authdns: add profile to role to clone new repo with netbox dns RRs [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [12:59:47] (03CR) 10Alexandros Kosiaris: [C:03+2] calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris) [13:00:05] Urbanecm and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1300). [13:00:05] aude: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:56] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:06:19] I can do the backport [13:06:47] (03Merged) 10jenkins-bot: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris) [13:08:09] (03PS2) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [13:08:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T397164 [13:08:21] T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164 [13:08:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2165 with weight 0 T397164', diff saved to https://phabricator.wikimedia.org/P78679 and previous config saved to /var/cache/conftool/dbconfig/20250625-130835-fceratto.json [13:10:43] (03PS1) 10Esanders: ArticleTarget: Avoid using chained promises with different return values [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) [13:11:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) (owner: 10Aude) [13:12:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:12:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.85 [13:12:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.86 [13:12:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders) [13:18:50] (03PS1) 10JMeybohm: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) [13:20:02] (03Merged) 10jenkins-bot: Fix missing title on charts and add tests [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) (owner: 10Aude) [13:20:05] (03PS1) 10Federico Ceratto: Switchover s8 master (db2161 -> db2165) [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164) [13:20:05] (03CR) 10Federico Ceratto: "s8 DC master flip as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164) (owner: 10Federico Ceratto) [13:20:14] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6066/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:20:30] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]] [13:20:35] T397755: Title is missing on charts (on beta cluster) - https://phabricator.wikimedia.org/T397755 [13:22:46] !log aude@deploy1003 aude: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:22:59] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez) [13:24:29] !log aude@deploy1003 aude: Continuing with sync [13:24:44] I can self-deploy my backport next [13:24:49] ok [13:24:54] almost done with mine [13:25:29] 👍 [13:26:21] !log disabled puppet on 'P{O:configcluster}' hosts - T352245 [13:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:30] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [13:26:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:26:51] (03PS1) 10Hnowlan: mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) [13:26:53] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [13:26:56] (03CR) 10Scott French: [C:03+2] P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [13:27:41] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.86 [13:27:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.87 [13:28:00] (03CR) 10Clément Goubert: [C:03+1] mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [13:28:31] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:50] (03PS1) 10Jelto: gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) [13:30:37] (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208) [13:30:48] (03PS1) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1163777 (https://phabricator.wikimedia.org/T397615) [13:30:50] (03PS1) 10Stevemunene: hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) [13:31:26] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]] (duration: 10m 56s) [13:31:32] T397755: Title is missing on charts (on beta cluster) - https://phabricator.wikimedia.org/T397755 [13:32:02] edsanders I'm done [13:32:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6067/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [13:32:38] (03PS3) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [13:32:55] (03Abandoned) 10Federico Ceratto: Switchover s8 master (db2161 -> db2165) [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164) (owner: 10Federico Ceratto) [13:33:45] (03CR) 10Scott French: [C:03+2] hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [13:34:34] (03CR) 10Arnaudb: [C:03+1] gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [13:34:55] (03PS4) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [13:34:57] edsanders: would it be possible for you to wait ~ 5 minutes or so before starting your backport? [13:38:14] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto) [13:38:21] (03PS3) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [13:39:30] !log migrated etcd tlsproxy to cfssl on conf2006 - T352245 [13:39:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:39:44] Deployment mobileapps-canary in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-canary - ... [13:39:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:40:11] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1160101 (https://phabricator.wikimedia.org/T397164) (owner: 10Gerrit maintenance bot) [13:40:32] PROBLEM - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is CRITICAL: SSL CRITICAL - Certificate etcd-v3.codfw.wmnet valid until 2025-07-23 13:33:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cergen [13:40:59] that's swfrench-wmf working :D [13:41:00] jelto: are you doing a puppet merge? [13:41:17] federico: yes merge is in progress, one sec [13:41:21] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:41:27] mobileapps alert is me, will be fixed when safe [13:41:32] done [13:41:41] thanks [13:41:42] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [13:41:52] vgutierrez: heh, yeah it seems to be a race between the new cert showing up and when the icinga check was updated for the new expiry :) [13:42:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.87 [13:42:31] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.88 [13:43:48] I didn't get a page though [13:45:07] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10947176 (10ssingh) I am going to tackle this for the DNS hosts at least and then we can revisit a generic solution. [13:45:15] (03CR) 10Federico Ceratto: [C:03+2] "(Discussed on IRC with Amir and approved)" [puppet] - 10https://gerrit.wikimedia.org/r/1160101 (https://phabricator.wikimedia.org/T397164) (owner: 10Gerrit maintenance bot) [13:46:31] !log Starting s8 codfw failover from db2161 to db2165 - T397164 [13:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164 [13:47:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T397164', diff saved to https://phabricator.wikimedia.org/P78681 and previous config saved to /var/cache/conftool/dbconfig/20250625-134758-fceratto.json [13:49:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders) [13:49:55] !log restarting confd in ulsfo - T352245 [13:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:00] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [13:50:55] (03PS1) 10Volans: Revert "redfish: add support for iDRAC 10" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163783 [13:52:26] (03PS1) 10Genoveva Galarza: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208) [13:54:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:48] (03PS5) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [13:55:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [13:55:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T395241)', diff saved to https://phabricator.wikimedia.org/P78682 and previous config saved to /var/cache/conftool/dbconfig/20250625-135523-fceratto.json [13:56:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.88 [13:56:55] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.89 [13:58:55] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:59:03] (03Merged) 10jenkins-bot: ArticleTarget: Avoid using chained promises with different return values [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders) [13:59:31] !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]] [13:59:34] RECOVERY - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is OK: SSL OK - Certificate etcd-v3.codfw.wmnet valid until 2025-07-23 13:33:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/PKI [13:59:36] T397818: "Invalid response from server" when switching to VE source mode - https://phabricator.wikimedia.org/T397818 [13:59:41] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163786 (https://phabricator.wikimedia.org/T391208) [13:59:46] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163787 (https://phabricator.wikimedia.org/T391208) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1400) [14:00:36] (03CR) 10Clément Goubert: [C:03+1] k8s.wipe-cluster: Run puppet in batches of 50 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [14:01:56] (03PS6) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [14:01:57] !log esanders@deploy1003 esanders: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:02:00] (03Abandoned) 10Volans: Revert "redfish: add support for iDRAC 10" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163783 (owner: 10Volans) [14:02:50] (03CR) 10Clément Goubert: [C:03+1] sre.wipe-cluster: Ask user to confirm target k8s version [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [14:02:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10947289 (10Jclark-ctr) Confirmed: Service Request 211933253 [14:03:31] (03Abandoned) 10Jforrester: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163786 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester) [14:03:35] (03Abandoned) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163787 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester) [14:04:25] (03PS1) 10Volans: redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) [14:04:38] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10947300 (10Jclark-ctr) [14:04:39] !log esanders@deploy1003 esanders: Continuing with sync [14:04:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T395241)', diff saved to https://phabricator.wikimedia.org/P78684 and previous config saved to /var/cache/conftool/dbconfig/20250625-140446-fceratto.json [14:05:05] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10947301 (10Volans) By trial and error with Luca we found that the Target parameter wants a list now. Sent new fix. [14:05:22] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:29] 06SRE, 06collaboration-services, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2025/2026-Q1): create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#10947307 (10lmata) [14:05:53] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:07:45] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:08:23] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:08:37] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:09:34] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:09:39] 06SRE, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): librenms-syslog leaks memory - https://phabricator.wikimedia.org/T397427#10947319 (10lmata) [14:10:04] (03PS7) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [14:10:58] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza) [14:11:11] !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]] (duration: 11m 40s) [14:11:17] T397818: "Invalid response from server" when switching to VE source mode - https://phabricator.wikimedia.org/T397818 [14:11:28] PROBLEM - Host wikikube-worker1243 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:05] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.89 [14:12:08] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8a [14:12:37] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza) [14:13:45] (03PS8) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [14:13:56] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:33] FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:34] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:15:11] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [14:15:27] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:16:08] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:16:18] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:16:26] (03PS9) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [14:17:08] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:17:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2161 T397164', diff saved to https://phabricator.wikimedia.org/P78685 and previous config saved to /var/cache/conftool/dbconfig/20250625-141729-ladsgroup.json [14:17:35] T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164 [14:17:42] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [14:17:43] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza) [14:19:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza) [14:19:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P78686 and previous config saved to /var/cache/conftool/dbconfig/20250625-141953-fceratto.json [14:20:29] !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:20:48] !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:21:01] (03PS3) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [14:21:09] !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:21:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [14:21:34] !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:21:44] !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:21:55] (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [14:22:07] !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:22:29] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [14:23:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [14:25:08] (03CR) 10Elukey: [C:03+1] redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [14:25:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8a [14:25:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8b [14:27:08] (03CR) 10Volans: [C:03+2] redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [14:27:57] (03CR) 10Filippo Giunchedi: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [14:28:57] swfrench-wmf: would you mind if I snuck in a mobileapps deploy? [14:29:44] hnowlan: go for it! I'm in a holding pattern for moment and will likely revert conf2006 shortly :) [14:29:57] ah, okay! [14:30:10] (03CR) 10Hnowlan: [C:03+2] mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [14:30:22] (03PS4) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [14:30:54] (03PS1) 10Scott French: Revert "hieradata: pilot cfssl/pki for nginx on conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1163798 (https://phabricator.wikimedia.org/T352245) [14:31:24] (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [14:31:47] (03CR) 10Herron: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [14:31:50] (03Merged) 10jenkins-bot: mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [14:32:29] (03CR) 10Scott French: [C:03+2] Revert "hieradata: pilot cfssl/pki for nginx on conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1163798 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:33:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1400) [14:33:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1430) [14:33:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:34:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:35:10] (03PS1) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 [14:36:01] (03Merged) 10jenkins-bot: redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [14:37:08] (03PS1) 10JHathaway: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163801 [14:38:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:39:38] That's probably the mobileapps redeploy cc hnowlan ^ [14:39:43] !log reverted etcd tlsproxy to cergen certs on conf2006 - T352245 [14:39:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:39:44] Deployment mobileapps-canary in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-canary - ... [14:39:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:48] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [14:39:59] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [14:40:01] We should wait a bit see how it stabilizes, and maybe up replica count [14:40:03] claime: erk, looking [14:40:14] (03PS1) 10Clare Ming: xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151) [14:40:20] hnowlan: rps is going back down already [14:40:22] already dropping but yeah, probably worth making a change once things level out [14:40:29] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8b [14:40:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8c [14:41:10] RPS is still climbing on mobileapps so we'll see [14:41:16] ack [14:42:11] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [14:43:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:43:20] (03PS4) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [14:45:22] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [14:45:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [14:46:07] (03PS2) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 [14:46:37] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [14:46:49] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming) [14:46:50] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet [14:46:51] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [14:47:30] (03CR) 10CI reject: [V:04-1] Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163801 (owner: 10JHathaway) [14:47:33] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [14:47:41] (03PS3) 10Effie Mouzeli: site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994) [14:47:53] (03CR) 10Effie Mouzeli: [C:03+2] site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [14:48:06] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [14:48:40] !log incrementally restarting confds in codfw, ulsfo, eqsin - T352245 [14:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:46] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [14:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:52:28] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [14:52:34] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [14:52:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet [14:54:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8c [14:54:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8d [14:57:36] (03PS5) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) [14:58:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:00:49] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet [15:04:32] (03CR) 10Ssingh: "Looking good, thanks for working on it!" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:05:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet [15:06:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db2161', diff saved to https://phabricator.wikimedia.org/P78689 and previous config saved to /var/cache/conftool/dbconfig/20250625-150657-fceratto.json [15:07:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:08:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8d [15:08:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8e [15:10:14] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet [15:10:50] (03CR) 10Cathal Mooney: "Ok, let me do that elsewhere and rebase and see if I can mangle it that way." [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:12:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2161', diff saved to https://phabricator.wikimedia.org/P78690 and previous config saved to /var/cache/conftool/dbconfig/20250625-151210-fceratto.json [15:12:45] (03PS1) 10Aude: Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812 [15:14:34] (03PS1) 10Cathal Mooney: Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985) [15:14:49] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1076-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:15:20] (03PS2) 10Jgiannelos: mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 [15:15:39] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 (owner: 10Jgiannelos) [15:15:40] (03PS5) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [15:16:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:48] (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [15:16:51] (03CR) 10Ssingh: [C:03+1] Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:17:24] (03Merged) 10jenkins-bot: mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 (owner: 10Jgiannelos) [15:19:44] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1252-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [15:20:13] (03PS1) 10Abijeet Patro: Mobile editor: restore VE toolbar position [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163814 (https://phabricator.wikimedia.org/T397840) [15:20:52] (03PS2) 10Volans: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) [15:20:53] (03PS2) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) [15:21:31] (03CR) 10Cathal Mooney: [C:03+2] Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:22:06] (03PS6) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:22:39] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:23:05] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:23:07] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8e [15:23:10] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8f [15:23:23] 06SRE, 10SRE-Access-Requests: Remove volunteer access from analytics-privatedata-users group - https://phabricator.wikimedia.org/T397850 (10mmartorana) 03NEW [15:23:38] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:24:10] (03PS10) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) [15:24:17] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye [15:24:37] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:24:39] (03PS6) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [15:24:56] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2161 - Depooling to then set weight [15:25:03] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2161 - Depooling to then set weight [15:25:04] (03CR) 10Cwhite: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [15:25:37] (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [15:25:58] !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts [15:26:04] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts [15:26:56] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1166-1168,1240-1242,1244-1251].eqiad.wmnet [15:26:59] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1166-1168,1240-1242,1244-1251].eqiad.wmnet [15:30:05] (03CR) 10Cathal Mooney: "Ok thanks, sorry it's a long way from production ready, submitting it a bit earlier than I would to test with test-cookbook. Great to get" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [15:30:21] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:30:47] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:31:44] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [15:32:03] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:33:14] (03PS7) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:33:31] (03CR) 10Ssingh: "Updates Hosts: in commit message to fail fast to debug; will revert later." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:33:45] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851 (10Clement_Goubert) 03NEW [15:34:08] (03PS7) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [15:34:16] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:42] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:34:43] (03PS3) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) [15:34:44] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker1243.eqiad.wmnet with reason: hw failure [15:34:56] (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [15:34:59] (03PS8) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:34:59] !log homer "cr*eqiad*" commit 'wikikube-worker1243 failed' [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:43] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:36:06] (03CR) 10Ssingh: [C:03+1] "PCC looks happy so I am a mere mortal." [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:36:27] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc2002.codfw.wmnet with reason: Decom [15:37:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:37:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8f [15:37:59] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.90 [15:38:31] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:40] (03PS9) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:38:45] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:38:56] (03PS1) 10AOkoth: doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) [15:39:01] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:39:14] (03PS8) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [15:39:59] (03PS2) 10AOkoth: doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) [15:40:49] stevemunene@cumin1002 reimage (PID 209642) is awaiting input [15:41:19] (03CR) 10Cathal Mooney: [C:03+2] Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [15:42:33] !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [15:42:48] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-esams and A:cp - 9.2.11 upgrade (T397456) [15:42:54] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [15:43:09] (03PS4) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) [15:43:12] !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:43:14] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:43:31] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:58] !log deploy GlobalNetworkPolicy targetting kube-dns by service on aux-k8s, dse-k8s, ml-serve, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1161535 [15:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:50] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852 (10phaultfinder) 03NEW [15:45:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:45:09] (03PS1) 10Muehlenhoff: Update contract end date for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1163817 [15:45:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:45:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P78692 and previous config saved to /var/cache/conftool/dbconfig/20250625-154529-fceratto.json [15:45:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:46:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P78693 and previous config saved to /var/cache/conftool/dbconfig/20250625-154637-fceratto.json [15:46:40] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet [15:47:11] !log run puppet on dns3003 to clone new repo with netbox generated dns records [15:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:21] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#10947812 (10elukey) [15:47:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.11 upgrade (T390912) [15:47:28] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [15:48:05] (03CR) 10AOkoth: "I've silenced the alerting for this host so merging should not result in any noise." [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [15:48:31] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:11] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:49:26] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:49:33] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:51:05] !log akosiaris@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:51:07] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.90 [15:51:09] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.91 [15:51:33] !log akosiaris@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:52:01] (03CR) 10Muehlenhoff: [C:03+2] Update contract end date for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1163817 (owner: 10Muehlenhoff) [15:52:04] (03CR) 10Btullis: [C:03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1163777 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [15:52:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet [15:52:27] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:52:36] (03PS10) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:52:37] (03CR) 10Elukey: [C:03+2] admin: allow dcops to use perccli and storcli via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1161382 (https://phabricator.wikimedia.org/T395939) (owner: 10Elukey) [15:52:39] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:53:12] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:53:39] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:53:53] (03CR) 10Btullis: hdfs: Assign the right role to new hadoop workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene) [15:54:31] (03PS11) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:58:31] FIRING: [6x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:58] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852#10947851 (10phaultfinder) [16:01:38] (03PS9) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [16:02:26] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1179.eqiad.wmnet with OS bullseye [16:02:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1167 gradually with 4 steps - Pooling in [16:03:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [16:03:31] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:53] (03CR) 10Elukey: [C:03+1] images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:05:18] (03CR) 10Elukey: [C:03+1] "Thanks and sorry, my bad!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [16:06:00] (03CR) 10Volans: [C:04-1] "Minor typo inline, LGTM otherwise" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:06:03] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.91 [16:06:06] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.92 [16:06:41] (03CR) 10Volans: [C:03+2] images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:06:51] (03CR) 10Volans: [C:03+2] src_packages: add migration for OS model [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [16:06:53] (03PS4) 10Muehlenhoff: Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) [16:07:14] (03CR) 10Muehlenhoff: Depend on libjs-bootstrap4 (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:07:26] (03Merged) 10jenkins-bot: images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:07:42] (03Merged) 10jenkins-bot: src_packages: add migration for OS model [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [16:07:44] (03PS10) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [16:10:27] (03PS1) 10Hnowlan: Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 [16:10:56] (03PS11) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) [16:11:48] (03CR) 10Clément Goubert: [C:03+1] Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan) [16:12:51] (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan) [16:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:13:31] FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:45] ^ working on this [16:14:05] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:14:24] (03PS12) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:14:32] (03Merged) 10jenkins-bot: Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan) [16:14:33] RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:35] (03CR) 10DLynch: [C:04-1] Deploy EditCheck's multi-check mode everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [16:15:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:15:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:15:58] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:16:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:16:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:17:46] !log cmooney@dns3003 START - running authdns-update [16:18:45] !log cmooney@dns3003 END - running authdns-update [16:19:39] (03PS13) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:20:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.92 [16:20:24] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.93 [16:20:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:20:53] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6075/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:21:02] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 30305632 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:21:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:22:02] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6403032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:22:38] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#10947906 (10elukey) Deployed! @Jclark-ctr please test and report back if anything is missing :) Puppet is currently rolling out the change, so give it one hour to pro... [16:22:59] (03CR) 10Ssingh: [V:03+1] "Success on cp hosts in eqiad; re-adding all previous Hosts and running PCC again." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:23:53] (03PS14) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:25:33] (03CR) 10Volans: [C:03+1] "Actually question inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:27:40] (03PS1) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 [16:27:43] (03PS12) 10Volans: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:29:08] (03CR) 10Volans: "question inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [16:29:17] (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey) [16:29:25] (03PS2) 10DLynch: Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [16:29:36] (03CR) 10DLynch: [C:03+1] Deploy EditCheck's multi-check mode everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [16:31:13] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6076/c" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [16:33:19] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1167 gradually with 4 steps - Pooling in [16:33:21] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:33:47] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.93 [16:33:50] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.94 [16:43:06] (03PS3) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) [16:43:12] (03PS1) 10Hnowlan: mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) [16:44:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:48:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:48:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.94 [16:48:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.95 [16:54:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:56:08] !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1252-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700) [17:00:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:01:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.95 [17:01:59] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.96 [17:09:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders) [17:10:21] jouncebot: nowandnext [17:10:21] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700) [17:10:21] In 0 hour(s) and 49 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800) [17:10:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [17:13:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.11 upgrade (T390912) [17:13:14] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [17:13:17] (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate ThreadRevision unserialize() errors [puppet] - 10https://gerrit.wikimedia.org/r/1163833 (https://phabricator.wikimedia.org/T259111) [17:14:34] (03CR) 10Scott French: "Thanks for the reviews!" [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [17:14:45] (03CR) 10Scott French: [C:03+2] wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [17:15:06] !log swfrench@dns1004 START - running authdns-update [17:16:10] !log swfrench@dns1004 END - running authdns-update [17:16:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.96 [17:16:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.97 [17:18:32] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-esams and A:cp - 9.2.11 upgrade (T397456) [17:18:39] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [17:21:35] (03CR) 10Scott French: [C:03+2] hieradata: remove swift-r[ow] from service catalog (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163397 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [17:22:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:22:44] Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [17:22:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:24:52] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:25:02] (03CR) 10Scott French: [C:03+1] mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:27:03] (03CR) 10Hnowlan: [C:03+2] mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:27:40] (03CR) 10Ahmon Dancy: [C:03+1] "This is ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [17:27:41] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.11 upgrade (T397456) [17:27:48] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [17:28:45] (03Merged) 10jenkins-bot: mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:29:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:30:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:31:23] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.97 [17:31:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.98 [17:32:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:32:44] Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [17:32:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:35:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:41:07] (03CR) 10Ssingh: [V:03+1] "Nice cleanup, much needed. I verified my own changes and left a few simple comments." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [17:43:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948154 (10VRiley-WMF) Unracked lvs1017 and installing the card now [17:45:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.98 [17:45:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.99 [17:46:55] jouncebot nowandnext [17:46:55] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700) [17:46:56] In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800) [17:49:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948171 (10VRiley-WMF) [17:50:58] I have a ContentTranslation UBN to deploy. It will affect a lot of users in group 1. Could I do it before the train? [17:51:11] Go for it [17:52:24] dancy thanks. Waiting for CI. Will keep you updated. [17:52:35] (03PS1) 10Ssingh: hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837 [17:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:54:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:50] (03PS1) 10Sbisson: CX3 Build 1.0.0+20250625 [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840) [17:56:19] (03CR) 10Scott French: [C:03+2] conftool-data: remove swift-r[ow] discovery entities (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163398 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [17:56:24] (03CR) 10Ssingh: "I will run PCC on this after the parent CR is merged, otherwise the SNR is terrible." [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh) [17:56:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:56:39] (03PS1) 10JHathaway: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 [17:56:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:56:44] Deployment mw-experimental.eqiad.pinkllama in mw-experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-experimental&var-deployment=mw-experimental.eqiad.pinkllama - ... [17:56:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:56:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:58:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:59:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840) (owner: 10Sbisson) [18:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800) [18:01:09] FYI: I'm squeezing a ContentTranslation fix on wmf.7 before the train [18:01:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:01:12] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.99 [18:01:14] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9a [18:01:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:07:44] stephanebisson: Please give me a ping when it's ready for me [18:09:08] jeena will do [18:10:12] (03PS1) 10Ssingh: P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842 [18:10:17] Thanks! [18:10:45] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250625 [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840) (owner: 10Sbisson) [18:10:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:11:11] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]] [18:11:20] T397840: SX Mobile editor has no toolbar on test wikipedia - https://phabricator.wikimedia.org/T397840 [18:12:43] (03PS2) 10BryanDavis: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [18:13:31] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:14:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9a [18:14:24] (03PS1) 10Ssingh: nagios_common and P:cache::haproxy: s/ats/haproxy for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843 [18:14:26] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9b [18:14:30] (03CR) 10BryanDavis: "PS2 is a manual rebase on Ic34f8304f9a4aa77e6ae1897cd2c0a3160363985. This will be reapplied on deployment-puppetserver-1 to resolve T39771" [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [18:15:17] (03CR) 10Ssingh: [V:03+1] cache::haproxy: Simplify cert configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [18:15:41] (03PS1) 10Ladsgroup: table-catalog: Fix private status of a couple of tables [puppet] - 10https://gerrit.wikimedia.org/r/1163844 [18:15:53] !log sbisson@deploy1003 sbisson: Continuing with sync [18:15:56] I'd like to fix wrong stuff on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests . Who do I need to bribe? [18:16:48] depends on what stuff you are looking to fix but you can join the clinic duty channel and then decide. [18:18:03] thanks [18:18:31] (03PS7) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) [18:18:45] (03CR) 10Ladsgroup: [C:03+2] table-catalog: Fix private status of a couple of tables [puppet] - 10https://gerrit.wikimedia.org/r/1163844 (owner: 10Ladsgroup) [18:19:15] (03CR) 10Scott French: [C:03+2] hieradata: remove swift-r[ow] SAN entries (cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/1163407 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [18:19:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:21:31] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]] (duration: 10m 20s) [18:21:37] T397840: SX Mobile editor has no toolbar on test wikipedia - https://phabricator.wikimedia.org/T397840 [18:22:30] jeena your turn [18:27:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9b [18:27:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9c [18:28:05] (03PS1) 10Sbisson: CX instrumentation: Fix translation providers in desktop editor events [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163845 (https://phabricator.wikimedia.org/T395493) [18:29:17] (03PS1) 10Ladsgroup: [WIP] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) [18:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:30:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948297 (10VRiley-WMF) [18:30:36] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [18:32:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948306 (10VRiley-WMF) Inserted new NIC. Moved the server to the new location (E2, U39, Port 39), ran the netbox script, and everything went through smoothly. @BCornwall it should be ready for the... [18:32:44] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177) [18:32:45] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:32:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:33:34] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:41:14] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.7 refs T392177 [18:41:21] T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177 [18:41:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:42:32] (03PS2) 10Ladsgroup: [WIP] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) [18:42:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9c [18:42:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9d [18:44:17] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [18:47:50] (03PS1) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [18:47:59] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 (owner: 10JHathaway) [18:48:14] (03CR) 10Ladsgroup: "I'm very confused, locally the result is now ordered but not in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [18:48:17] (03PS1) 10Dwisehaupt: icinga: decommission frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868) [18:48:22] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [18:48:56] (03CR) 10Dwisehaupt: [C:04-1] "Marking -1 until machines are powered off and ready for decom." [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868) (owner: 10Dwisehaupt) [18:49:12] (03PS1) 10Eevans: cassandra-dev200[23]: setup for (no reuse) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1163852 (https://phabricator.wikimedia.org/T391544) [18:49:14] (03PS1) 10Eevans: cassandra-dev2002: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163853 (https://phabricator.wikimedia.org/T391544) [18:49:15] (03PS1) 10Eevans: cassandra-dev2003: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163854 (https://phabricator.wikimedia.org/T391544) [18:50:59] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.11 upgrade (T397456) [18:51:04] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [18:51:44] (03PS2) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) [18:52:22] (03CR) 10Eevans: [C:03+2] cassandra-dev200[23]: setup for (no reuse) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1163852 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [18:55:19] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1163850/6079/" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:56:17] (03PS1) 10Ssingh: P:bird and C:bird::anycast: support exporting Prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) [18:56:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9d [18:56:23] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9e [18:57:28] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6080/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [18:57:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:58:03] (03PS3) 10JHathaway: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [18:58:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:59:35] (03PS1) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) [19:00:04] (03CR) 10CI reject: [V:04-1] hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [19:00:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh) [19:00:57] (03PS2) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) [19:04:20] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [19:06:40] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye [19:06:53] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2002.... [19:08:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:10:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:12:07] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9e [19:12:09] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9f [19:20:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:21:22] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [19:21:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:23:04] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [19:25:46] (03PS3) 10Scott French: hieradata: remove mw-wikifunctions discovery services [puppet] - 10https://gerrit.wikimedia.org/r/1163856 (https://phabricator.wikimedia.org/T384944) [19:26:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9f [19:26:03] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a0 [19:26:24] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [19:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:26:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:28:48] (03CR) 10Scott French: "I happened to notice this while working on the swift-r[ow] turndown earlier today." [puppet] - 10https://gerrit.wikimedia.org/r/1163856 (https://phabricator.wikimedia.org/T384944) (owner: 10Scott French) [19:34:59] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1003.eqiad.wmnet with OS bookworm [19:38:41] (03PS1) 10Andrew Bogott: keystone policy: allow object_storage role to create/delete ec2 creds [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594) [19:40:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a0 [19:40:29] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a1 [19:41:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:44:28] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye [19:44:46] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2002.codf... [19:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:46:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:48:10] (03CR) 10Eevans: [C:03+2] cassandra-dev2002: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163853 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [19:49:52] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [19:50:59] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm [19:54:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a1 [19:54:17] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a2 [19:57:27] (03CR) 10Ebomani: [C:03+1] "Looks good to me! Tested and verified that for the new (non-legacy) Patchdemo related changes we get redirect links in the 'Checks' tab to" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2000). [20:00:05] arlolra and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ [20:00:47] (03PS4) 10JHathaway: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [20:01:09] hi - if a deployer is needed, i can deploy [20:01:17] here [20:01:26] I can handle my deploy [20:01:31] I'm fine doing mine, too. [20:01:42] Kemayo: I'll get started? [20:01:49] arlolra: Go for it, you're first in the list. [20:01:55] Ok [20:02:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:03:22] (03Merged) 10jenkins-bot: Undeploy VipsScaler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:03:31] FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:47] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]] [20:03:53] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:04:33] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [20:04:55] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm [20:06:21] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10948585 (10Scott_French) 05Open→03Resolved This is done now. Thanks for the reviews, all! [20:06:33] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [20:06:54] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm [20:07:53] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi) [20:08:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a2 [20:08:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a3 [20:09:41] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [20:09:43] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm [20:10:19] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:10:52] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [20:10:57] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm [20:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:13:54] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [20:16:11] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye [20:16:30] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2003.... [20:16:59] (03PS2) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) [20:18:58] Hmm, it seems to be "Building container images" for an inordinate amount of time [20:19:09] cjming: Kemayo: any ideas/ [20:19:51] arlolra: I haven't seen a stall on that particular one before, sorry. [20:19:57] me neither [20:20:24] (03PS3) 10BryanDavis: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [20:20:24] arlolro: Looks like localisation files were rebuilt: `537 languages rebuilt out of 537` [20:20:36] Aha, the sympathetic magic of asking about it has caused it to progress. [20:20:42] :) [20:20:50] That results in several gigabytes of data being generated which takes a long time to containerize and sync. [20:21:26] dancy: thanks. Was that from /var/lib/spiderpig/scap-image-build-and-push-log ? [20:21:43] I looked in the job log: https://spiderpig.wikimedia.org/jobs/253 [20:22:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a3 [20:22:28] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a4 [20:22:30] Ok [20:22:53] (03CR) 10CI reject: [V:04-1] [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle) [20:22:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:57] (03CR) 10Volans: "Alternative approach suggestion inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [20:23:23] I would like to deploy an update to the chart-renderer service https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163812 (maybe when backports are done, though the service deploy is unrelated) [20:23:29] arlolra: The usual cause of this is a direct change to a localisation json file. But in this case the change is indirect due to the removal of an extension and its associate l10n files. [20:24:46] I see [20:25:00] aude: A parallel deployment should be fine. We're just doing a lot of waiting at the moment. [20:25:10] ok thanks [20:25:27] (03CR) 10Aude: [C:03+2] Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812 (owner: 10Aude) [20:27:05] (03Merged) 10jenkins-bot: Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812 (owner: 10Aude) [20:27:41] !log aude@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:27:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:18] !log aude@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:28:58] (03CR) 10Eevans: [C:03+2] cassandra-dev2003: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163854 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:30:28] !log aude@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [20:31:03] !log aude@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [20:31:35] !log aude@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [20:32:07] !log aude@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [20:32:17] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [20:32:45] (03PS1) 10Jgreen: Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865) [20:33:04] I'm done. looks good and will be around to monitor [20:34:41] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:34:47] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:36:14] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [20:36:30] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [20:37:05] Huh, I hadn't noticed before that people who aren't the user who started the spiderpig run also get the option to answer the "continue with sync?" question. :D [20:37:10] !log arlolra@deploy1003 arlolra: Continuing with sync [20:37:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a4 [20:37:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a5 [20:38:39] Keymayo: The commit owner is notified too [20:40:03] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [20:45:35] dancy: I'm not that either, thus my surprise as a completely-unrelated user. [20:46:01] 07Puppet, 10Beta-Cluster-Infrastructure: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877 (10bd808) 03NEW [20:46:20] Oh interesting. Can you point me to the notification you're talking about? [20:47:09] Not a notification. When you're looking at https://spiderpig.wikimedia.org/ you see the "continue with sync? [yes] [no]" prompt inside the job-history on the currently-running job. [20:47:52] ooh, gotcha. Any user can respond to an interaction. That's right. That's a deliberate behavior. [20:48:33] I figured it made sense as a way to avoid everything getting stuck because someone wandered away, it just caught me by surprise for a second. :D [20:48:41] Nod. [20:50:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a5 [20:50:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a6 [20:51:25] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]] (duration: 47m 37s) [20:51:31] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:51:43] Kemayo: sorry to have used up so much of the window [20:51:54] arlolra: It was more than I expected, but no worries. [20:52:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders) [20:52:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [20:53:00] (03Merged) 10jenkins-bot: Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders) [20:53:05] (03Merged) 10jenkins-bot: Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders) [20:53:29] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]] [20:53:37] T359815: Enable Visual Editor on Wikipedia namespace on Armenian Wikipedia - https://phabricator.wikimedia.org/T359815 [20:53:37] T395519: [Multi-Check] Deploy Multi-Check (References) to all Wikipedias - https://phabricator.wikimedia.org/T395519 [20:56:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:58:07] !log kemayo@deploy1003 kemayo, esanders: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:59:01] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm [20:59:36] !log kemayo@deploy1003 kemayo, esanders: Continuing with sync [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2100) [21:01:19] (03CR) 10Dwisehaupt: [C:03+1] Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen) [21:01:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:02:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye [21:02:38] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2003.codf... [21:03:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a6 [21:03:49] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a7 [21:06:15] (03CR) 10Jgreen: [C:03+2] Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen) [21:06:30] !log jgreen@dns1004 START - running authdns-update [21:06:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:07:08] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]] (duration: 13m 38s) [21:07:14] T359815: Enable Visual Editor on Wikipedia namespace on Armenian Wikipedia - https://phabricator.wikimedia.org/T359815 [21:07:15] T395519: [Multi-Check] Deploy Multi-Check (References) to all Wikipedias - https://phabricator.wikimedia.org/T395519 [21:07:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:07:33] !log jgreen@dns1004 END - running authdns-update [21:12:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:13:20] (03PS1) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) [21:14:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:16:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a7 [21:16:57] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a8 [21:21:33] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#10948815 (10bd808) `lang=shell-session bd808@deployment-puppetserver-1:~$ sudo -i puppet agent -t... [21:24:18] (03PS1) 10Andrew Bogott: Cloudcephosd200[456]-dev: make ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163884 (https://phabricator.wikimedia.org/T397237) [21:24:59] (03CR) 10Andrew Bogott: [C:03+2] Cloudcephosd200[456]-dev: make ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163884 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott) [21:26:09] (03CR) 10BryanDavis: [V:03+1] "Cherry-picked to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud and tested for desired behavior. See T397877#10948815 fo" [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis) [21:26:59] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#10948820 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 [21:27:43] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:31:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a8 [21:31:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a9 [21:31:26] (03PS1) 10Andrew Bogott: Add hiera for new cloudcephosd nodes in codfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1163885 (https://phabricator.wikimedia.org/T397237) [21:32:06] (03CR) 10Andrew Bogott: [C:03+2] Add hiera for new cloudcephosd nodes in codfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1163885 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott) [21:33:32] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886 [21:34:05] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886 (owner: 10Ahmon Dancy) [21:34:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:35:01] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886 (owner: 10Ahmon Dancy) [21:35:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:36:10] (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888 [21:36:23] (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888 (owner: 10Ahmon Dancy) [21:37:29] (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888 (owner: 10Ahmon Dancy) [21:37:47] (03PS1) 10Andrew Bogott: Cloudcephosd200[567]-dev: puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1163889 (https://phabricator.wikimedia.org/T397237) [21:38:27] (03CR) 10Andrew Bogott: [C:03+2] Cloudcephosd200[567]-dev: puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1163889 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott) [21:41:32] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [21:47:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a9 [21:47:22] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.aa [21:54:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:55:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:56:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:57:56] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2200) [22:03:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [22:03:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.aa [22:03:37] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage [22:03:40] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ab [22:16:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:18:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:18:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ab [22:18:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ac [22:20:09] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [22:21:02] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye [22:23:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:24:04] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [22:24:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:25:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:27:04] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [22:27:06] !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [22:29:47] RESOLVED: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:32:50] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ac [22:32:52] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ad [22:37:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:40:55] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm [22:41:59] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:42:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:43:44] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [22:43:52] !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [22:45:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:45:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:03] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:38] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ad [22:46:41] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ae [22:47:24] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage [22:51:21] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [22:52:53] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:54:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:54:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:40] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10948977 (10Andrew) Currently we only have one NIC connected for each of these. Ports are scarce in that rack, so the plan (in too much detail) is:... [23:00:15] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10948980 (10Andrew) 05Resolved→03Open [23:02:16] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ae [23:02:19] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.af [23:03:28] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye [23:03:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:06:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:08:47] !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [23:16:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.af [23:16:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b0 [23:21:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:22:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:27:15] (03PS1) 10Andrea Denisse: Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 [23:29:20] (03CR) 10CI reject: [V:04-1] Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 (owner: 10Andrea Denisse) [23:31:41] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b0 [23:31:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b1 [23:37:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900 (owner: 10TrainBranchBot) [23:38:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:42:29] (03PS2) 10Andrea Denisse: Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 [23:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:46:07] (03CR) 10Andrea Denisse: [C:03+2] Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 (owner: 10Andrea Denisse) [23:46:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b1 [23:46:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b2 [23:48:27] (03PS1) 10Andrea Denisse: Revert^2 "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163901 [23:49:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900 (owner: 10TrainBranchBot) [23:58:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [23:59:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures