[00:02:55] <logmsgbot>	 jhancock@cumin1003 provision (PID 3053871) is awaiting input
[00:03:30] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:04:12] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4e
[00:04:15] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4f
[00:04:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 (owner: 10Stang)
[00:05:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:07:13] <logmsgbot>	 jhancock@cumin1003 provision (PID 3056449) is awaiting input
[00:08:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492
[00:08:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492 (owner: 10TrainBranchBot)
[00:10:41] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:10:55] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:11:34] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[00:15:23] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:15:34] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4f
[00:15:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.50
[00:18:36] <logmsgbot>	 jhancock@cumin1003 provision (PID 3056821) is awaiting input
[00:25:32] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:25:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:29:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163492 (owner: 10TrainBranchBot)
[00:29:24] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.50
[00:29:27] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.51
[00:29:40] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye
[00:29:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2005-dev.codfw.wmnet with OS...
[00:42:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:42:46] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd2006-dev']
[00:42:55] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd2006-dev']
[00:43:19] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye
[00:43:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2006-dev.codfw.wmnet with OS...
[00:43:56] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:44:23] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[00:44:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host cloudcephosd2007-dev.codfw.wmnet with OS...
[00:44:41] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.51
[00:44:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.52
[00:45:19] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage
[00:46:40] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/0a99d5a1b686396d5c351ea7dc4d928f57630c612633dd8fdbc18679486af8a0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:48:45] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage
[00:59:24] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage
[01:00:04] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[01:02:06] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.52
[01:02:09] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.53
[01:02:30] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage
[01:05:05] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[01:06:41] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:12:14] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:15:18] <logmsgbot>	 jhancock@cumin1003 reimage (PID 3059175) is awaiting input
[01:15:57] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.53
[01:16:00] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.54
[01:22:34] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:25:38] <logmsgbot>	 jhancock@cumin1003 reimage (PID 3059765) is awaiting input
[01:27:50] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:29:47] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.54
[01:29:50] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.55
[01:30:54] <logmsgbot>	 jhancock@cumin1003 reimage (PID 3059700) is awaiting input
[01:31:33] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:31:34] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye
[01:31:36] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:31:36] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye
[01:31:38] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003"
[01:31:39] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[01:31:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2005-dev.codfw.wmnet with OS bul...
[01:31:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2006-dev.codfw.wmnet with OS bul...
[01:31:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host cloudcephosd2007-dev.codfw.wmnet with OS bul...
[01:35:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945602 (10Jhancock.wm) 05Open→03Resolved
[01:35:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10945605 (10Jhancock.wm) @Andrew done!
[01:37:06] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10945606 (10Jhancock.wm) @volans give it a shot on cp2044. if you have any issues with it, lmk
[01:43:17] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.55
[01:43:20] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.56
[01:43:49] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[01:44:11] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[01:58:04] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.56
[01:58:07] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.57
[02:12:09] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.57
[02:12:12] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.58
[02:26:21] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3633 MB (3% inode=98%): /tmp 3633 MB (3% inode=98%): /var/tmp 3633 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[02:26:40] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.58
[02:26:43] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.59
[02:33:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 120211312 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:34:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4256016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:41:08] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.59
[02:41:11] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5a
[02:44:28] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:54:54] <wikibugs>	 (03PS1) 10Krinkle: beta: Switch excimer-ui-url service from wmflabs.org to wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163502 (https://phabricator.wikimedia.org/T289318)
[02:55:36] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5a
[02:55:39] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5b
[02:58:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 158335184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:59:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6059584 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:02:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:07:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:09:00] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5b
[03:09:03] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5c
[03:23:12] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5c
[03:23:15] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5d
[03:36:13] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5d
[03:36:16] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5e
[03:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[03:46:21] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3625 MB (3% inode=98%): /tmp 3625 MB (3% inode=98%): /var/tmp 3625 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[03:51:57] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5e
[03:52:00] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.5f
[03:54:44] <wikibugs>	 (03PS7) 10Scott French: P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245)
[03:54:44] <wikibugs>	 (03PS4) 10Scott French: hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245)
[03:54:44] <wikibugs>	 (03PS4) 10Scott French: hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245)
[03:54:45] <wikibugs>	 (03PS5) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245)
[03:57:15] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[03:57:26] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[04:03:30] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:06:14] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.5f
[04:06:17] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.60
[04:08:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[04:13:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:20:44] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.60
[04:20:47] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.61
[04:34:22] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.61
[04:34:24] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.62
[04:49:23] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.62
[04:49:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.63
[05:05:17] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.63
[05:05:20] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.64
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:18:30] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.64
[05:18:32] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.65
[05:19:44] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[05:33:12] <wikibugs>	 (03Abandoned) 10Stang: Fix missing Chinese translation related to temporary accounts [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 (owner: 10Stang)
[05:33:57] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.65
[05:34:00] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.66
[05:47:32] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.66
[05:47:35] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.67
[05:54:28] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0600)
[06:02:32] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.67
[06:02:35] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.68
[06:06:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:12:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937#10945822 (10Morale99) For anyone dealing with fiber connections or testing networks, using a good quality [[ https://www.firefold.com/collections/fibe...
[06:16:28] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.68
[06:16:31] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.69
[06:21:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[06:30:37] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.69
[06:30:40] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6a
[06:42:43] <wikibugs>	 (03PS1) 10Arnaudb: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715)
[06:44:37] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6a
[06:44:40] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6b
[06:51:42] <wikibugs>	 (03CR) 10Jelto: "looks mostly good, two comments in-line" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb)
[06:53:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Record extended contract date for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1163629
[06:54:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record extended contract date for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1163629 (owner: 10Muehlenhoff)
[06:57:18] <wikibugs>	 (03PS1) 10Samwilson: Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630
[06:57:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson)
[06:59:20] <wikibugs>	 (03CR) 10Jelto: "Thanks for adding the alert! Two suggestions in line" [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb)
[06:59:42] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.6b
[06:59:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6c
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0700). Please do the needful.
[07:00:04] <jouncebot>	 suzannewoodWMDE2, isaranto, Kizule, and samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:13] <isaranto>	 o/
[07:00:18] <suzannewoodWMDE2>	 I am here
[07:00:22] <Kizule>	 Same
[07:01:16] <samwilson>	 I also am here
[07:05:23] <isaranto>	 is anyone deploying? is it ok if I start my patch?
[07:08:56] <samwilson>	 isaranto: I'm not sure who's deploying today. Amir1, Urbanecm, or awight are any of you around?
[07:10:35] <wikibugs>	 (03PS1) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204)
[07:10:52] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440)
[07:12:45] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6c
[07:12:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6d
[07:12:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan)
[07:13:10] <kostajh>	 isaranto: are you deploying? 
[07:13:35] <isaranto>	 no but I can start!
[07:14:01] <kostajh>	 sounds good to me 
[07:14:12] <isaranto>	 starting!
[07:14:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos)
[07:15:33] <wikibugs>	 (03Merged) 10jenkins-bot: ores-extension: enable revertrisk filter in UI for third batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos)
[07:16:12] <logmsgbot>	 !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]]
[07:16:17] <stashbot>	 T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis  - https://phabricator.wikimedia.org/T395824
[07:16:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan)
[07:17:35] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696)
[07:18:34] <logmsgbot>	 !log isaranto@deploy1003 isaranto: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:18:41] <isaranto>	 testing!
[07:19:26] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176)
[07:19:39] <kostajh>	 isaranto: are you able to do the other config patches in the window as well? 
[07:19:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:20:37] <wikibugs>	 (03PS2) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696)
[07:21:33] <wikibugs>	 (03CR) 10Kosta Harlan: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:22:23] <kostajh>	 suzannewoodWMDE2: I have a question for you on the config patch above ^
[07:22:32] <suzannewoodWMDE2>	 ok!
[07:23:06] <logmsgbot>	 !log isaranto@deploy1003 isaranto: Continuing with sync
[07:23:35] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup)
[07:23:42] <isaranto>	 sry I was QAing
[07:23:51] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Netbox hosts: add netbox-dns reposync repo so it is available [puppet] - 10https://gerrit.wikimedia.org/r/1163382 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[07:23:58] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892)
[07:24:33] <wikibugs>	 (03CR) 10Suzanne Wood: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:24:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[07:24:59] <isaranto>	 kostajh: I can deploy other patches as well, but I'll have to go in 30'
[07:25:11] <isaranto>	 I'm taking a look at the other patches atm
[07:25:28] <wikibugs>	 (03PS2) 10Joely Rooke WMDE: Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685)
[07:25:56] <wikibugs>	 (03PS4) 10Arnaudb: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715)
[07:26:09] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove external cloud sync from Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff)
[07:26:11] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892)
[07:26:12] <wikibugs>	 (03PS2) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204)
[07:26:19] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6d
[07:26:21] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6e
[07:26:23] <wikibugs>	 (03CR) 10Suzanne Wood: [C:03+1] Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:26:52] <kostajh>	 isaranto: cool. I need a few more minutes on mine
[07:26:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans)
[07:27:08] <wikibugs>	 (03CR) 10Kosta Harlan: Activate feature to resolve wikibase link labels in pilot wiki changelists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:27:09] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[07:27:10] <isaranto>	 tbh I'd prefer not too. the rest of the patches haven't been reviewed
[07:27:47] <kostajh>	 ok, I don't mind to do them
[07:28:05] <isaranto>	 thank you kostajh  <3
[07:28:15] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1149.eqiad.wmnet
[07:28:22] <kostajh>	 suzannewoodWMDE2: are you able to verify your change when it's deployed? 
[07:28:29] <kostajh>	 same question for samwilson 
[07:28:52] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10945907 (10elukey) Done! The patch is ready to go in my opinion, thanks!
[07:29:15] <suzannewoodWMDE2>	 Thanks! : ) We've addressed your comment so 1163372 is ready. Yes we can verify when it's deployed
[07:29:18] <samwilson>	 kostajh: yep, I can verify
[07:29:19] <wikibugs>	 (03PS2) 10Kosta Harlan: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204)
[07:29:29] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892)
[07:30:12] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[07:30:20] <logmsgbot>	 !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163405|ores-extension: enable revertrisk filter in UI for third batch (T395824)]] (duration: 14m 08s)
[07:30:26] <stashbot>	 T395824: [batch #3] Enable revertrisk filters in recent changes in multiple wikis  - https://phabricator.wikimedia.org/T395824
[07:30:29] <isaranto>	 done!
[07:30:45] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1149.eqiad.wmnet
[07:31:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:31:21] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks great, one question/doubt inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[07:31:56] <isaranto>	 kostajh: you can go ahead
[07:32:04] <wikibugs>	 (03Merged) 10jenkins-bot: Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[07:32:25] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan)
[07:32:27] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]]
[07:32:35] <stashbot>	 T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685
[07:33:30] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Enable temporarily read only backups for refresh [puppet] - 10https://gerrit.wikimedia.org/r/1163645 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[07:34:43] <logmsgbot>	 !log kharlan@deploy1003 joelyrookewmde, kharlan: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:36:06] <kostajh>	 suzannewoodWMDE2: please verify on mwdebug 
[07:36:27] <suzannewoodWMDE2>	 It works!
[07:37:21] <wikibugs>	 (03PS1) 10Slyngshede: P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985)
[07:38:03] <logmsgbot>	 !log kharlan@deploy1003 joelyrookewmde, kharlan: Continuing with sync
[07:38:07] <kostajh>	 cool :) 
[07:38:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[07:39:57] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.6e
[07:40:00] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.6f
[07:40:56] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1175.eqiad.wmnet
[07:42:28] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb)
[07:42:41] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1175.eqiad.wmnet
[07:42:45] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb)
[07:44:00] <wikibugs>	 (03Merged) 10jenkins-bot: mailman: alert on out queue being too full [alerts] - 10https://gerrit.wikimedia.org/r/1163628 (https://phabricator.wikimedia.org/T397715) (owner: 10Arnaudb)
[07:44:12] <wikibugs>	 (03CR) 10Jelto: "lgtm now, thank you" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb)
[07:45:09] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176)
[07:45:30] <wikibugs>	 (03CR) 10Volans: kubernetes: add a new kubernetes section (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[07:45:30] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163372|Activate feature to resolve wikibase link labels in pilot wiki changelists (T388685)]] (duration: 13m 03s)
[07:45:36] <stashbot>	 T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685
[07:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:46:11] <kostajh>	 alright, on to samwilson's patch 
[07:46:42] <suzannewoodWMDE2>	 Thanks!
[07:46:44] <kostajh>	 Kizule: I'll sync yours at the same time as well 
[07:46:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson)
[07:46:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21)
[07:47:41] <Kizule>	 Oh, I'm still here.
[07:47:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163630 (owner: 10Samwilson)
[07:47:46] <wikibugs>	 (03Merged) 10jenkins-bot: Enable block feature for AbuseFilter on all small Serbian wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21)
[07:48:04] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: set an-worker1176 to reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163692 (https://phabricator.wikimedia.org/T390176)
[07:48:08] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]]
[07:48:13] <stashbot>	 T392363: Enable block feature for AbuseFilter on all small Serbian wikiprojects - https://phabricator.wikimedia.org/T392363
[07:48:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:48:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10945954 (10Stevemunene) `an-worker1175` had the drives in an UGood state  `...
[07:48:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10945955 (10Stevemunene)
[07:48:59] <kostajh>	 suzannewoodWMDE2: I think we need to roll back your patch 
[07:49:56] <kostajh>	 suzannewoodWMDE2: https://logstash.wikimedia.org/goto/b55135916319237318f0f77abeed4093
[07:49:57] <suzannewoodWMDE2>	 Ok, what's the problem?
[07:50:03] <kostajh>	 I should have checked the logs during deploy, my fault.
[07:50:25] <logmsgbot>	 !log kharlan@deploy1003 zoranzoki21, kharlan, samwilson: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:50:42] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693
[07:50:43] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892)
[07:50:51] <kostajh>	 samwilson / Kizule please verify your changes 
[07:50:57] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Nice LGTM, would be nice to complete the test coverage. Not a blocker." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi)
[07:50:57] <Kizule>	 kostajh: Mine is good to go.
[07:51:07] <kostajh>	 +1
[07:51:16] <wikibugs>	 (03CR) 10Jcrespo: [C:04-2] "Backups have not completed yet, wait." [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[07:51:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan)
[07:51:46] <suzannewoodWMDE2>	 Oh yeah we see the error, thanks for reverting
[07:52:42] <wikibugs>	 (03CR) 10Cmelo: Release the CampaignEvents extension to all Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo)
[07:52:49] <wikibugs>	 (03CR) 10Cmelo: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo)
[07:52:56] <kostajh>	 samwilson: are we OK to proceed? 
[07:53:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:53:22] <samwilson>	 kostajh: yep!
[07:53:33] <logmsgbot>	 !log kharlan@deploy1003 zoranzoki21, kharlan, samwilson: Continuing with sync
[07:55:04] <wikibugs>	 (03CR) 10Volans: [C:03+2] redfish: add support for iDRAC 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans)
[07:55:22] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.6f
[07:55:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.70
[07:58:15] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:58:30] <jinxer-wm>	 FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800)
[08:00:45] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163630|Revert "InitialiseSettings: Enable TemplateDiscovery on almost all wikis"]], [[gerrit:1163365|Enable block feature for AbuseFilter on all small Serbian wikiprojects (T392363)]] (duration: 12m 37s)
[08:00:51] <stashbot>	 T392363: Enable block feature for AbuseFilter on all small Serbian wikiprojects - https://phabricator.wikimedia.org/T392363
[08:01:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan)
[08:01:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[08:01:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163693 (owner: 10Kosta Harlan)
[08:02:06] <vgutierrez>	 expected?
[08:02:12] <vgutierrez>	 !incidents
[08:02:12] <sirenbot>	 6427 (UNACKED)  ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet esams)
[08:02:13] <kostajh>	 vgutierrez: yes, reverting a change 
[08:02:18] <vgutierrez>	 !ack 6427
[08:02:18] <sirenbot>	 6427 (ACKED)  ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet esams)
[08:02:23] <vgutierrez>	 thx kostajh 
[08:02:25] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]]
[08:02:41] <_joe_>	 not expected but we know why
[08:02:58] <vgutierrez>	 _joe_: expected as in "we know what's going on" :)
[08:03:11] * hnowlan here
[08:03:18] <hnowlan>	 ah :) 
[08:03:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:04:09] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: add support for iDRAC 10 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans)
[08:04:12] <kostajh>	 yeah, sorry. for the next time: how should I alert SRE that we know the cause and are in process of reverting a patch?
[08:04:38] <vgutierrez>	 pinging us here or -sre should be enough
[08:04:38] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:05:00] <kostajh>	 ack
[08:05:04] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162908 (owner: 10PipelineBot)
[08:05:22] <Amir1>	 I'm around if you need help (oncall)
[08:05:28] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi)
[08:05:29] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[08:05:31] <wikibugs>	 (03CR) 10Volans: [C:03+2] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi)
[08:05:39] <kostajh>	 this should resolve shortly 
[08:05:49] <kostajh>	 k8s-willing
[08:06:51] <jinxer-wm>	 FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[08:06:54] <_joe_>	 it's going down fast
[08:06:57] <_joe_>	 heh
[08:07:04] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162908 (owner: 10PipelineBot)
[08:08:56] <wikibugs>	 (03PS1) 10Jelto: cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548)
[08:08:58] <kostajh>	 I started a discussion in -developer-experience on Slack about including mediawiki-debug messages in the spiderpig.wikimedia.org UI 
[08:09:07] <jynus>	 was that the same alert going twice or is there a difference?
[08:09:38] <vgutierrez>	 jynus: different PoPs
[08:10:09] <kostajh>	 there is another logspam issue fwiw with Extension:Cite (https://logstash.wikimedia.org/goto/21ee9b1086f65aa9d536247d2d159a5c) but not related to this deployment window 
[08:10:17] <jynus>	 thanks, it wasn't clear on the msg to me
[08:10:46] <vgutierrez>	 yeah.. we should include the site
[08:10:58] <wikibugs>	 (03PS1) 10Hashar: Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760)
[08:11:05] <kostajh>	 T397760 is the other logspam issue 
[08:11:06] <stashbot>	 T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760
[08:11:42] <hashar>	 o/
[08:11:51] <jinxer-wm>	 RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[08:12:05] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.70
[08:12:08] <hashar>	 I am happy to backport that log spam patch now if that can help, but I don't think it is related to whatever is ongoing right now
[08:12:08] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.71
[08:12:44] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163693|Revert "Activate feature to resolve wikibase link labels in pilot wiki changelists"]] (duration: 10m 18s)
[08:12:59] <kostajh>	 it's not
[08:13:19] <kostajh>	 the error messages related to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1163372 should be resolved
[08:13:24] <kostajh>	 one more config patch to go in this window 
[08:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[08:13:31] <kostajh>	 jouncebot: nowandnext
[08:13:31] <jouncebot>	 For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800)
[08:13:31] <jouncebot>	 In 1 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000)
[08:14:11] <hashar>	 +1 for deployment
[08:14:12] <wikibugs>	 (03PS1) 10Vgutierrez: ATSBackendErrorsHigh: Report the impacted site on summary [alerts] - 10https://gerrit.wikimedia.org/r/1163698
[08:14:13] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760) (owner: 10Hashar)
[08:14:13] <kostajh>	 I will try not to break everything this time
[08:14:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan)
[08:14:32] <hashar>	 the train will be run later tonight by Jeena (she is on US west coast)
[08:15:15] <hashar>	 breaking stuff is fine, as long as you fix it :]
[08:15:19] <wikibugs>	 (03Merged) 10jenkins-bot: Pass SecurityLogContext to logger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163633 (https://phabricator.wikimedia.org/T395204) (owner: 10Kosta Harlan)
[08:15:41] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]]
[08:15:46] <stashbot>	 T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204
[08:15:52] <wikibugs>	 (03Merged) 10jenkins-bot: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi)
[08:17:45] <wikibugs>	 (03PS3) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985)
[08:17:53] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:18:38] <wikibugs>	 (03CR) 10Volans: [C:03+2] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi)
[08:19:45] <wikibugs>	 (03CR) 10Suzanne Wood: [C:03+1] "What happened was:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[08:20:32] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[08:21:46] <wikibugs>	 (03PS4) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985)
[08:21:48] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[08:22:05] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10946089 (10Fabfur) User added to the phabricator "nda" group
[08:22:09] <wikibugs>	 (03PS3) 10Muehlenhoff: debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696)
[08:23:02] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750)
[08:23:26] <wikibugs>	 (03PS2) 10Herron: admin: add ldap_only entry for derhexer [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099)
[08:24:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[08:24:45] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:24:55] <icinga-wm>	 PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 2 (backup1013, ...), Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:24:58] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron)
[08:25:31] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[08:25:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene)
[08:25:58] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.71
[08:26:01] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.72
[08:26:16] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene)
[08:26:42] <joelyrookewmde>	 Sorry all! Forgot that a crucial part of the change for 1163372 is still on this week's train and not deployed to all pilot wikis where the feature was activated.
[08:26:50] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: set an-worker1176 to analytics-fex recipe [puppet] - 10https://gerrit.wikimedia.org/r/1163635 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene)
[08:27:00] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] admin: add ldap_only entry for derhexer [puppet] - 10https://gerrit.wikimedia.org/r/1160216 (https://phabricator.wikimedia.org/T397099) (owner: 10Herron)
[08:27:25] <wikibugs>	 (03PS4) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360
[08:27:33] <wikibugs>	 (03CR) 10Muehlenhoff: kubernetes: add a new kubernetes section (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[08:28:01] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163633|Pass SecurityLogContext to logger (T395204)]] (duration: 12m 19s)
[08:28:06] <stashbot>	 T395204: MediaWiki should log request information (IP, user agent, referrer, HTTP method, etc) in a more uniform and predictable way - https://phabricator.wikimedia.org/T395204
[08:28:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[08:28:22] <wikibugs>	 (03PS5) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985)
[08:28:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] debmonitor_dev: Update bind address for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1163634 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[08:28:49] <kostajh>	 joelyrookewmde: it's ok, thanks for commenting on the task and hopefully the next deployment is smoother :)
[08:28:59] <wikibugs>	 (03CR) 10Hashar: "No worries @suzanne.wood@wikimedia.de, can you copy paste this comment on the Phabricator task T388685 please? That will help discovery la" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[08:29:00] <moritzm>	 fabfur: I'll merge your data.yaml patch along
[08:29:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi)
[08:29:27] <wikibugs>	 (03Merged) 10jenkins-bot: Check if details marker is set before accessing it [extensions/Cite] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163696 (https://phabricator.wikimedia.org/T397760) (owner: 10Hashar)
[08:29:40] <hashar>	 joelyrookewmde: it is perfectly fine no worries. Ideally that should have been caught by a test that ensures the config setting works with both deployed versions but we do not have such testing system :]
[08:29:50] <hashar>	 joelyrookewmde: that got caught and rolled back. It is fine!
[08:30:05] <kostajh>	 !log UTC morning deploys done
[08:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:29] <wikibugs>	 (03PS2) 10Hnowlan: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750)
[08:30:50] <hashar>	 I am deploying the Cite path
[08:31:42] <wikibugs>	 (03CR) 10Ayounsi: "Addressed all the comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi)
[08:31:47] <hashar>	 hmm or maybe Thiemo is on it
[08:32:24] <logmsgbot>	 !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]]
[08:32:27] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[08:32:29] <stashbot>	 T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760
[08:32:51] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[08:33:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu...
[08:34:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi)
[08:34:35] <logmsgbot>	 !log hashar@deploy1003 hashar: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:35:25] <logmsgbot>	 !log hashar@deploy1003 hashar: Continuing with sync
[08:36:04] <wikibugs>	 (03PS5) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360
[08:36:09] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet
[08:36:11] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10946159 (10Fabfur) Hello, the user has been added to the "nda" ldap group, can you please try and  confirm you can now access the needed resources?
[08:36:38] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet
[08:38:15] <wikibugs>	 (03CR) 10Joely Rooke WMDE: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE)
[08:40:55] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.72
[08:40:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.73
[08:41:51] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet
[08:42:12] <wikibugs>	 (03PS1) 10Joely Rooke WMDE: Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704
[08:42:15] <logmsgbot>	 !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163696|Check if details marker is set before accessing it (T397760)]] (duration: 09m 51s)
[08:42:18] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet
[08:42:20] <stashbot>	 T397760: PHP Warning: Undefined array key "details" - https://phabricator.wikimedia.org/T397760
[08:44:07] <wikibugs>	 (03CR) 10Joely Rooke WMDE: "Scheduling this for backport in afternoon window of Thursday, 26th June 2025, after all groups have been pushed to 1.45.0-wmf.7 (contains " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE)
[08:44:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE)
[08:49:40] <wikibugs>	 (03CR) 10FNegri: [C:03+1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro)
[08:51:20] <wikibugs>	 (03PS6) 10Ayounsi: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314
[08:52:05] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro)
[08:53:11] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[08:54:23] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.73
[08:54:26] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.74
[08:55:27] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[08:57:39] <logmsgbot>	 !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye
[08:57:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946260 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1...
[08:59:58] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[09:01:00] <wikibugs>	 (03PS6) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985)
[09:01:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[09:03:04] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: remove memory limit for canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163702 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[09:03:40] <wikibugs>	 (03CR) 10Volans: [C:03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi)
[09:03:45] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet
[09:04:38] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[09:04:52] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet
[09:05:48] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[09:06:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[09:06:50] <wikibugs>	 (03CR) 10Cathal Mooney: "An alias didn't do the trick, it would just pick the empty array from hieradata/common/profile/spicerack/reposync.yaml.  No great options " [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[09:09:11] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.74
[09:09:14] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.75
[09:09:39] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet
[09:10:35] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet
[09:12:57] <wikibugs>	 (03Merged) 10jenkins-bot: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi)
[09:15:05] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [alerts] - 10https://gerrit.wikimedia.org/r/1163698 (owner: 10Vgutierrez)
[09:16:37] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] ATSBackendErrorsHigh: Report the impacted site on summary [alerts] - 10https://gerrit.wikimedia.org/r/1163698 (owner: 10Vgutierrez)
[09:19:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[09:22:14] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.75
[09:22:17] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.76
[09:24:16] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[09:24:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu...
[09:25:27] <wikibugs>	 (03PS1) 10Elukey: admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696)
[09:25:29] <wikibugs>	 (03PS1) 10Elukey: admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696)
[09:25:30] <wikibugs>	 (03PS1) 10Elukey: aux/dse: remove the usage of sha256 digest image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696)
[09:25:39] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714
[09:26:08] <wikibugs>	 (03CR) 10Jelto: [C:03+1] gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb)
[09:27:16] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb)
[09:27:50] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[09:28:38] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714 (owner: 10Volans)
[09:28:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:29:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[09:30:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[09:32:31] <wikibugs>	 (03PS2) 10Slyngshede: P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985)
[09:33:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[09:33:37] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6062/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:33:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:34:26] <elukey>	 14
[09:34:28] <elukey>	 iff
[09:34:42] <elukey>	 today is not my day
[09:34:45] <Amir1>	 jouncebot: nowandnext
[09:34:45] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T0800)
[09:34:45] <jouncebot>	 In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000)
[09:34:57] <codders>	 if 14 was a password, probably time to change it anyway :)
[09:34:58] * Amir1 gives coffee to elukey <3
[09:35:27] <elukey>	 <3
[09:35:44] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398)
[09:35:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup)
[09:36:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez)
[09:36:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez)
[09:36:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup)
[09:36:37] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup)
[09:36:56] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-records Generate and push DNS records from Netbox data
[09:37:01] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]]
[09:37:04] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-records (exit_code=0) Generate and push DNS records from Netbox data
[09:37:06] <stashbot>	 T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367
[09:37:30] <wikibugs>	 (03PS2) 10Vgutierrez: liberica: Don't start liberica-cp on system boot [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398)
[09:37:46] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.76
[09:37:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.77
[09:38:15] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v11.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163714 (owner: 10Volans)
[09:39:09] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:39:56] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[09:40:42] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez)
[09:41:07] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:42:07] <wikibugs>	 (03PS3) 10Slyngshede: P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985)
[09:42:20] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[09:43:02] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: disable tag->sha256 for all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163712 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[09:43:28] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[09:44:10] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM, some small nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:44:48] <logmsgbot>	 !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye
[09:45:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1...
[09:45:10] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye
[09:45:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cu...
[09:46:37] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163323|Clean up EventBus and jobs config (T397367)]] (duration: 09m 36s)
[09:46:43] <stashbot>	 T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367
[09:49:14] <wikibugs>	 (03PS4) 10Slyngshede: P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985)
[09:51:03] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.77
[09:51:05] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.78
[09:51:16] <wikibugs>	 (03PS5) 10Slyngshede: P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985)
[09:52:05] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6064/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:53:38] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2008.codfw.wmnet with reason: Maintenance and reboot
[09:54:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:54:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:55:48] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6065/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:56:16] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] P:dns::auth::netbox  Netbox DNS zones file sync [puppet] - 10https://gerrit.wikimedia.org/r/1163690 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[09:57:43] <wikibugs>	 (03PS1) 10Volans: Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000)
[10:00:06] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716 (owner: 10Volans)
[10:01:48] <wikibugs>	 (03PS1) 10Slyngshede: P:dns::auth::netbox_dns_records fix branch name [puppet] - 10https://gerrit.wikimedia.org/r/1163717 (https://phabricator.wikimedia.org/T362985)
[10:02:37] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1014 is CRITICAL: All failures: 2 (backup1013, ...), Fresh: 140 jobs Jcrespo ongoing backups, expected - The acknowledgement expires at: 2025-06-27 10:02:16. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:04:12] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:dns::auth::netbox_dns_records fix branch name [puppet] - 10https://gerrit.wikimedia.org/r/1163717 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[10:04:48] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.78
[10:04:50] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.79
[10:05:18] <logmsgbot>	 !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye
[10:05:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1...
[10:07:36] <wikibugs>	 (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719
[10:09:38] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: readd group 9 and 10 hosts back to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1163691 (https://phabricator.wikimedia.org/T390176) (owner: 10Stevemunene)
[10:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v11.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1163716 (owner: 10Volans)
[10:11:02] <Amir1>	 the deployment is stuck in syncing to apaches (bare metals)
[10:11:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723
[10:11:16] <Amir1>	 for 35 minutes now
[10:11:43] <claime>	 Amir1: huh
[10:11:55] <Amir1>	 I think it might be actually my connection
[10:11:57] <Amir1>	 one second
[10:12:23] <Amir1>	 yup, my connection dropped and it wasn't moving forward, I reconnected and screen says it's finished
[10:12:29] <Amir1>	 sigh, sorry for the false alarm
[10:12:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: Add frban1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1163430 (https://phabricator.wikimedia.org/T395951) (owner: 10Dwisehaupt)
[10:12:36] <claime>	 :D
[10:13:33] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:13:46] <Amir1>	 !log dropping table job in group0 (T397367)
[10:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:50] <stashbot>	 T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367
[10:14:56] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1154.eqiad.wmnet
[10:15:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946573 (10ops-monitoring-bot) Host an-worker1154.eqiad.wmnet rebooted by stevemunene@cumin1002 w...
[10:15:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449)
[10:16:40] <hnowlan>	 jouncebot: nowandnext
[10:16:40] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000)
[10:16:40] <jouncebot>	 In 0 hour(s) and 43 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100)
[10:16:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:17:03] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[10:18:11] <icinga-wm>	 PROBLEM - Host wikikube-worker1069 is DOWN: PING CRITICAL - Packet loss = 100%
[10:19:02] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.79
[10:19:05] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7a
[10:19:16] <Lucas_WMDE>	 jouncebot: nowandnext
[10:19:16] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1000)
[10:19:16] <jouncebot>	 In 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100)
[10:19:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1069.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1069.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:19:37] <Lucas_WMDE>	 I wouldn’t mind doing a Wikibase backport if that’s okay with everyone else (esp. hnowlan ig ^^)
[10:20:21] <wikibugs>	 (03CR) 10Suzanne Wood: [C:03+1] Revert^2 "Activate feature to resolve wikibase link labels in pilot wiki changelists" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163704 (owner: 10Joely Rooke WMDE)
[10:21:18] <hnowlan>	 Lucas_WMDE: no objections from me 
[10:21:26] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506)
[10:21:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[10:21:35] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506)
[10:21:50] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1154.eqiad.wmnet
[10:21:54] <Lucas_WMDE>	 alright, I’ll backport ^ those two in a few minutes if I don’t hear any objections :)
[10:22:03] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480)
[10:22:17] <Lucas_WMDE>	 (actually, on second thought, I’ll just start the backport now. that still leaves like at least 10 minutes for someone to object during the gate-and-submit build anyway :D
[10:22:18] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1149.eqiad.wmnet
[10:22:19] <Lucas_WMDE>	 )
[10:22:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db1252 weight to 300 - see T385141', diff saved to https://phabricator.wikimedia.org/P78677 and previous config saved to /var/cache/conftool/dbconfig/20250625-102225-fceratto.json
[10:22:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946593 (10ops-monitoring-bot) Host an-worker1149.eqiad.wmnet rebooted by st...
[10:22:31] <stashbot>	 T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141
[10:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE))
[10:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE))
[10:23:23] <Lucas_WMDE>	 hmm, https://spiderpig.wikimedia.org/jobs/249 didn’t show me the Yes/No buttons to confirm until I reloaded the page
[10:23:32] <Lucas_WMDE>	 let’s see if it happens again or if it was just a hiccup
[10:23:54] <Lucas_WMDE>	 (I could see the “Backport the changes?” prompt in the terminal but the interactive part at the top of the page was missing)
[10:25:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi)
[10:26:12] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::prometheus: Add scrape rules for Loki/Alloy [puppet] - 10https://gerrit.wikimedia.org/r/1163729 (https://phabricator.wikimedia.org/T386480)
[10:27:34] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "We discussed this in #wikimedia-k8s-sig and on the dse side at least, we're not comfortable with this change. The use of checksums is to m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163713 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[10:28:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449)
[10:30:51] <volans>	 !log uploaded spicerack_11.1.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia
[10:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:29] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1149.eqiad.wmnet
[10:32:36] <wikibugs>	 (03PS2) 10Clément Goubert: P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017)
[10:32:40] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert)
[10:32:53] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7a
[10:32:55] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7b
[10:34:39] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1150.eqiad.wmnet
[10:34:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946631 (10ops-monitoring-bot) Host an-worker1150.eqiad.wmnet rebooted by stevemunene@cumin1002 wi...
[10:37:10] <xSavitar>	 !log Ran fixStuckGlobalRename.php for T397807
[10:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:16] <stashbot>	 T397807: Unblock stuck global rename of ReadMore - https://phabricator.wikimedia.org/T397807
[10:38:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi)
[10:39:10] <wikibugs>	 (03CR) 10Jelto: [C:03+2] cleanup prerm script update-alternatives command [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163695 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[10:39:58] <wikibugs>	 (03Merged) 10jenkins-bot: Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163727 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE))
[10:40:00] <wikibugs>	 (03Merged) 10jenkins-bot: Clicking the search button goes to Special:Search [extensions/Wikibase] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163728 (https://phabricator.wikimedia.org/T397506) (owner: 10Lucas Werkmeister (WMDE))
[10:40:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]]
[10:40:36] <stashbot>	 T397506: ScopedTypeaheadSearch - clicking the search button redirects to the main page - https://phabricator.wikimedia.org/T397506
[10:41:31] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1150.eqiad.wmnet
[10:41:50] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet
[10:42:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946643 (10ops-monitoring-bot) Host an-worker1151.eqiad.wmnet rebooted by stevemunene@cumin1002 wi...
[10:42:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:43:39] <Lucas_WMDE>	 works \o/
[10:43:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[10:43:51] <Lucas_WMDE>	 and SpiderPig showed me the prompt correctly as well
[10:44:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert)
[10:44:52] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] P::mediawiki::maintenance: rsync to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1163731 (https://phabricator.wikimedia.org/T397017) (owner: 10Clément Goubert)
[10:46:18] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7b
[10:46:21] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7c
[10:47:25] <jelto>	 !log import kubernetes 1.31.4-6 to apt host - T387548
[10:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:30] <stashbot>	 T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548
[10:47:44] <wikibugs>	 (03PS1) 10Tchanders: WIP temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738
[10:48:28] <wikibugs>	 (03CR) 10Tchanders: [C:04-2] "Date and set of wikis to be confirmed. Needs comms approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (owner: 10Tchanders)
[10:48:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163738 (owner: 10Tchanders)
[10:49:04] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet
[10:51:20] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup2008.codfw.wmnet: Renew puppet certificate - root@cumin1002
[10:52:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163727|Clicking the search button goes to Special:Search (T397506)]], [[gerrit:1163728|Clicking the search button goes to Special:Search (T397506)]] (duration: 11m 52s)
[10:52:28] <stashbot>	 T397506: ScopedTypeaheadSearch - clicking the search button redirects to the main page - https://phabricator.wikimedia.org/T397506
[10:52:48] <wikibugs>	 (03PS1) 10Klausman: hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1163739 (https://phabricator.wikimedia.org/T335491)
[10:52:49] * Lucas_WMDE done deploying
[10:53:08] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1163739 (https://phabricator.wikimedia.org/T335491) (owner: 10Klausman)
[10:53:44] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411)
[10:56:04] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[11:00:05] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1100)
[11:00:55] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7c
[11:00:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7d
[11:05:04] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[11:05:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Depend on libjs-bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696)
[11:08:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.327s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:08:38] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply
[11:08:58] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:09:42] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750)
[11:13:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.327s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:14:34] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:14:40] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7d
[11:14:42] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7e
[11:16:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: add num_worker param, default setting to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163743 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[11:20:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[11:20:44] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:20:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[11:21:04] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-eqiad
[11:21:11] <wikibugs>	 (03PS1) 10Clément Goubert: mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746
[11:21:16] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:21:53] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:22:23] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:23:37] <claime>	 !log Manual powercycle of wikikube-worker1069.eqiad.wmnet
[11:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:14] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484)
[11:24:47] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:25:14] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:26:45] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[11:28:40] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829 (10Clement_Goubert) 03NEW
[11:29:07] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
[11:29:33] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7e
[11:29:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.7f
[11:30:20] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:30:30] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:31:45] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:35:28] <claime>	 !log homer "cr*eqiad*" commit 'wikikube-worker1069 failed' - T397829
[11:35:33] <wikibugs>	 (03PS27) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318
[11:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:34] <stashbot>	 T397829: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829
[11:35:56] <wikibugs>	 (03PS2) 10Klausman: services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900
[11:37:32] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts
[11:37:38] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts
[11:37:51] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 15 hosts
[11:37:57] <logmsgbot>	 !log root@cumin1002 DONE (ERROR) - Cookbook sre.puppet.renew-cert (exit_code=97) for backup1008.eqiad.wmnet: Renew puppet certificate - root@cumin1002
[11:37:58] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 15 hosts
[11:38:24] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1008.eqiad.wmnet with reason: Maintenance and reboot
[11:38:38] <logmsgbot>	 !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker1069.eqiad.wmnet with reason: hw failure
[11:40:08] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1061-1068,1070-1075].eqiad.wmnet
[11:40:11] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1061-1068,1070-1075].eqiad.wmnet
[11:40:45] <wikibugs>	 (03PS3) 10Filippo Giunchedi: tox: add python3.13 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163723 (https://phabricator.wikimedia.org/T395449)
[11:40:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: add _status for type annotations [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449)
[11:41:52] <wikibugs>	 (03CR) 10Klausman: [C:03+2] services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 (owner: 10Klausman)
[11:42:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney)
[11:43:26] <wikibugs>	 (03Merged) 10jenkins-bot: services/machinetranslation: add network policy to allow access to Thanos/Swift S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 (owner: 10Klausman)
[11:43:42] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1076-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[11:44:22] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.7f
[11:44:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.80
[11:44:53] <logmsgbot>	 !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[11:44:58] <logmsgbot>	 !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[11:45:12] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[11:45:27] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[11:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:46:20] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[11:46:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832 (10Nahid) 03NEW
[11:46:34] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[11:47:43] <wikibugs>	 (03CR) 10Jcrespo: "It looks like a really bad idea to hardcode the events for the query killer on the code." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[11:48:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10946799 (10KLevan) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDsriWHRqsnwYuPFQvTiXHa1KNwrFYvRRnq1QQpEkpdmCxBbq+EQTKL4S9oTi8XjjCyDVt1lwswPQUTe2iBgMWrmGL3Ez+b9G1RY4MWWTw1IWP0ExSsOEQDZK8hzYbKA82eNpfW7N+jY8qv3WyPuVG6q4...
[11:48:33] <wikibugs>	 (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob)
[11:48:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update katelevan's ssh key - https://phabricator.wikimedia.org/T397832#10946802 (10Nahid)
[11:48:52] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[11:48:59] <logmsgbot>	 !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[11:49:05] <logmsgbot>	 !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[11:49:25] <logmsgbot>	 !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[11:49:29] <logmsgbot>	 !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[11:50:39] <wikibugs>	 (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob)
[11:51:20] <wikibugs>	 (03CR) 10Jcrespo: Add switchover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[11:51:37] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[11:52:21] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163719 (owner: 10Jakob)
[11:52:44] <wikibugs>	 (03PS28) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318
[11:52:45] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[11:53:36] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[11:53:49] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[11:54:07] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[11:54:10] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:54:23] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[11:54:30] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "As I said on IRC, much of this should go into the battle-tested db-switchover. Then the cookbook can handle the different steps separatell" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[11:54:41] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[11:54:57] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[11:55:37] <wikibugs>	 (03Abandoned) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[11:56:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Depend on libjs-bootstrap [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696)
[11:56:39] <wikibugs>	 (03PS3) 10Muehlenhoff: Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696)
[11:57:32] <wikibugs>	 (03PS29) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318
[11:57:53] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.80
[11:57:56] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.81
[11:58:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:59:26] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert)
[12:01:28] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert)
[12:03:09] <wikibugs>	 (03Merged) 10jenkins-bot: mw-parsoid: Scale down replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163746 (owner: 10Clément Goubert)
[12:03:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:03:46] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[12:03:51] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[12:03:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:03:59] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[12:04:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney)
[12:04:19] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[12:04:21] <wikibugs>	 (03CR) 10Federico Ceratto: "I'm summarizing here the discussion on irc with Jaime and Amir on wed 25 jun:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[12:04:26] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[12:04:35] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[12:07:46] <wikibugs>	 (03Restored) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[12:11:03] <wikibugs>	 10SRE-tools, 10Spicerack: Flaky icinga unit tests - https://phabricator.wikimedia.org/T397833 (10fgiunchedi) 03NEW
[12:11:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.696s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:11:33] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Flaky spicerack icinga unit tests - https://phabricator.wikimedia.org/T397833#10946842 (10fgiunchedi)
[12:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:14:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1080:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:14:13] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.81
[12:14:16] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.82
[12:16:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.696s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:22:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:24:10] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1092:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:24:26] <wikibugs>	 (03PS2) 10Ladsgroup: tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal)
[12:24:31] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1152.eqiad.wmnet
[12:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946860 (10ops-monitoring-bot) Host an-worker1152.eqiad.wmnet rebooted by stevemunene@cumin1002 wi...
[12:24:50] <wikibugs>	 (03CR) 10Ladsgroup: tables-catalog: add PageAssessments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal)
[12:26:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:27:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.188s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:28:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:28:45] <wikibugs>	 (03PS3) 10Ladsgroup: tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal)
[12:28:47] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal)
[12:28:49] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: add PageAssessments [puppet] - 10https://gerrit.wikimedia.org/r/1161578 (https://phabricator.wikimedia.org/T393792) (owner: 10MusikAnimal)
[12:29:30] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.82
[12:29:33] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.83
[12:29:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:30:34] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:30:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:31:53] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1152.eqiad.wmnet
[12:32:15] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1153.eqiad.wmnet
[12:32:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946882 (10ops-monitoring-bot) Host an-worker1153.eqiad.wmnet rebooted by stevemunene@cumin1002 wi...
[12:34:17] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10946886 (10Volans) No way, it doesn't work yet, but I need to understand why: `lang=python >>> import xml.etree.ElementTree as ET >>> from xml.dom import minidom >>> sc...
[12:34:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:37:40] <wikibugs>	 (03PS30) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318
[12:38:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:39:06] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1153.eqiad.wmnet
[12:39:31] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1175.eqiad.wmnet
[12:39:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946892 (10ops-monitoring-bot) Host an-worker1175.eqiad.wmnet rebooted by stevemunene@cumin1002 wi...
[12:40:04] <wikibugs>	 (03PS31) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318
[12:42:25] <wikibugs>	 (03PS23) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034)
[12:42:26] <wikibugs>	 (03PS8) 10Arnaudb: gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034)
[12:42:27] <wikibugs>	 (03PS6) 10Arnaudb: gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034)
[12:42:28] <wikibugs>	 (03PS8) 10Arnaudb: gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034)
[12:42:40] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:42:50] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:42:51] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:42:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:43:04] <wikibugs>	 (03PS10) 10Arnaudb: gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034)
[12:43:05] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:43:06] <wikibugs>	 (03PS6) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360
[12:43:38] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.83
[12:43:41] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.84
[12:43:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:45:01] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: disable tag->sha256-digest resolution for knative on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163711 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[12:46:39] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for backup1008.eqiad.wmnet: Renew puppet certificate - root@cumin1002
[12:47:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney)
[12:47:05] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1175.eqiad.wmnet
[12:47:47] <volans>	 jynus: if that was you (sre.puppet.renew-cert on backup1008) please try to avoid to run cookbooks with double sudo ;) (yes I will make a patch for it at some point)
[12:49:09] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: probe DNS on both hosts before doing stuff [cookbooks] - 10https://gerrit.wikimedia.org/r/1141862 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:49:29] <jynus>	 volans: indeed, sorry
[12:49:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946917 (10Stevemunene) The hosts have rejoined the cluster and the cluster is healthy {F62459029}...
[12:50:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946918 (10Stevemunene)
[12:50:35] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: rsync --checksum local backup safety net [cookbooks] - 10https://gerrit.wikimedia.org/r/1142793 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: grepping for misconfigurations [cookbooks] - 10https://gerrit.wikimedia.org/r/1143102 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:50:38] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: git backup tree consistency checker [cookbooks] - 10https://gerrit.wikimedia.org/r/1144565 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:50:41] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb)
[12:51:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946929 (10Stevemunene) `an-worker1149` was not upgraded as we did not have enough disks for the n...
[12:51:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946933 (10Stevemunene)
[12:51:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10946934 (10Jhancock.wm)
[12:52:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10946935 (10Stevemunene) 05Open→03Resolved
[12:53:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10946939 (10Stevemunene) an-worker1154 is back in the cluster, still working on an-worker1176 T390...
[12:53:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#10946943 (10Jhancock.wm) I need to change the preseed.yaml file so that sretest2005, sretest2006, sretest2009, and sretest2010 (just to cover some other servers in one go) have the same partman as sret...
[12:56:00] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb)
[12:58:25] <wikibugs>	 (03PS1) 10Cathal Mooney: Authdns: add profile to role to clone new repo with netbox dns RRs [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[12:58:54] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.84
[12:58:57] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.85
[12:59:15] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535
[12:59:38] <wikibugs>	 (03PS2) 10Cathal Mooney: Authdns: add profile to role to clone new repo with netbox dns RRs [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[12:59:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris)
[13:00:05] <jouncebot>	 Urbanecm and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1300).
[13:00:05] <jouncebot>	 aude: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:01:56] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[13:06:19] <aude>	 I can do the backport
[13:06:47] <wikibugs>	 (03Merged) 10jenkins-bot: calico default-deny: Switch other clusters to follow wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161535 (owner: 10Alexandros Kosiaris)
[13:08:09] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484)
[13:08:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T397164
[13:08:21] <stashbot>	 T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164
[13:08:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2165 with weight 0 T397164', diff saved to https://phabricator.wikimedia.org/P78679 and previous config saved to /var/cache/conftool/dbconfig/20250625-130835-fceratto.json
[13:10:43] <wikibugs>	 (03PS1) 10Esanders: ArticleTarget: Avoid using chained promises with different return values [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818)
[13:11:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) (owner: 10Aude)
[13:12:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[13:12:33] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.85
[13:12:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.86
[13:12:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders)
[13:18:50] <wikibugs>	 (03PS1) 10JMeybohm: pyrra::filesystem::slos::istio: Fix PromQL to work with istio 1.24 [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984)
[13:20:02] <wikibugs>	 (03Merged) 10jenkins-bot: Fix missing title on charts and add tests [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) (owner: 10Aude)
[13:20:05] <wikibugs>	 (03PS1) 10Federico Ceratto: Switchover s8 master (db2161 -> db2165) [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164)
[13:20:05] <wikibugs>	 (03CR) 10Federico Ceratto: "s8 DC master flip as discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164) (owner: 10Federico Ceratto)
[13:20:14] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6066/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163769 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:20:30] <logmsgbot>	 !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]]
[13:20:35] <stashbot>	 T397755: Title is missing on charts (on beta cluster) - https://phabricator.wikimedia.org/T397755
[13:22:46] <logmsgbot>	 !log aude@deploy1003 aude: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:22:59] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1163715 (https://phabricator.wikimedia.org/T396398) (owner: 10Vgutierrez)
[13:24:29] <logmsgbot>	 !log aude@deploy1003 aude: Continuing with sync
[13:24:44] <edsanders>	 I can self-deploy my backport next
[13:24:49] <aude>	 ok
[13:24:54] <aude>	 almost done with mine
[13:25:29] <edsanders>	 👍
[13:26:21] <swfrench-wmf>	 !log disabled puppet on 'P{O:configcluster}' hosts - T352245
[13:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:30] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[13:26:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:26:51] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750)
[13:26:53] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[13:26:56] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:etcd::tlsproxy: add support for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[13:27:41] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.86
[13:27:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.87
[13:28:00] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[13:28:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:29:50] <wikibugs>	 (03PS1) 10Jelto: gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622)
[13:30:37] <wikibugs>	 (03PS1) 10Genoveva Galarza: wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208)
[13:30:48] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1163777 (https://phabricator.wikimedia.org/T397615)
[13:30:50] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Assign the right role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615)
[13:31:26] <logmsgbot>	 !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163469|Fix missing title on charts and add tests (T397755)]] (duration: 10m 56s)
[13:31:32] <stashbot>	 T397755: Title is missing on charts (on beta cluster) - https://phabricator.wikimedia.org/T397755
[13:32:02] <aude>	 edsanders I'm done
[13:32:05] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6067/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[13:32:38] <wikibugs>	 (03PS3) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[13:32:55] <wikibugs>	 (03Abandoned) 10Federico Ceratto: Switchover s8 master (db2161 -> db2165) [puppet] - 10https://gerrit.wikimedia.org/r/1163770 (https://phabricator.wikimedia.org/T397164) (owner: 10Federico Ceratto)
[13:33:45] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[13:34:34] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[13:34:55] <wikibugs>	 (03PS4) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[13:34:57] <swfrench-wmf>	 edsanders: would it be possible for you to wait ~ 5 minutes or so before starting your backport?
[13:38:14] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: disable second sshd on test instance [puppet] - 10https://gerrit.wikimedia.org/r/1163774 (https://phabricator.wikimedia.org/T396622) (owner: 10Jelto)
[13:38:21] <wikibugs>	 (03PS3) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484)
[13:39:30] <swfrench-wmf>	 !log migrated etcd tlsproxy to cfssl on conf2006 - T352245
[13:39:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[13:39:44] <jinxer-wm>	 Deployment mobileapps-canary in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-canary - ...
[13:39:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[13:40:11] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1160101 (https://phabricator.wikimedia.org/T397164) (owner: 10Gerrit maintenance bot)
[13:40:32] <icinga-wm>	 PROBLEM - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is CRITICAL: SSL CRITICAL - Certificate etcd-v3.codfw.wmnet valid until 2025-07-23 13:33:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cergen
[13:40:59] <vgutierrez>	 that's swfrench-wmf working :D
[13:41:00] <federico3>	 jelto: are you doing a puppet merge?
[13:41:17] <jelto>	 federico: yes merge is in progress, one sec
[13:41:21] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[13:41:27] <hnowlan>	 mobileapps alert is me, will be fixed when safe
[13:41:32] <jelto>	 done
[13:41:41] <federico3>	 thanks
[13:41:42] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[13:41:52] <swfrench-wmf>	 vgutierrez: heh, yeah it seems to be a race between the new cert showing up and when the icinga check was updated for the new expiry :)
[13:42:28] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.87
[13:42:31] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.88
[13:43:48] <Amir1>	 I didn't get a page though
[13:45:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Alert when anycast-healthchecker withdraws BGP route - https://phabricator.wikimedia.org/T374619#10947176 (10ssingh) I am going to tackle this for the DNS hosts at least and then we can revisit a generic solution.
[13:45:15] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] "(Discussed on IRC with Amir and approved)" [puppet] - 10https://gerrit.wikimedia.org/r/1160101 (https://phabricator.wikimedia.org/T397164) (owner: 10Gerrit maintenance bot)
[13:46:31] <federico3>	 !log Starting s8 codfw failover from db2161 to db2165 - T397164
[13:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:37] <stashbot>	 T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164
[13:47:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2165 to s8 primary T397164', diff saved to https://phabricator.wikimedia.org/P78681 and previous config saved to /var/cache/conftool/dbconfig/20250625-134758-fceratto.json
[13:49:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders)
[13:49:55] <swfrench-wmf>	 !log restarting confd in ulsfo - T352245
[13:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:00] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[13:50:55] <wikibugs>	 (03PS1) 10Volans: Revert "redfish: add support for iDRAC 10" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163783
[13:52:26] <wikibugs>	 (03PS1) 10Genoveva Galarza: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208)
[13:54:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:54:48] <wikibugs>	 (03PS5) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[13:55:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[13:55:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T395241)', diff saved to https://phabricator.wikimedia.org/P78682 and previous config saved to /var/cache/conftool/dbconfig/20250625-135523-fceratto.json
[13:56:52] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.88
[13:56:55] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.89
[13:58:55] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[13:59:03] <wikibugs>	 (03Merged) 10jenkins-bot: ArticleTarget: Avoid using chained promises with different return values [extensions/VisualEditor] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163768 (https://phabricator.wikimedia.org/T397818) (owner: 10Esanders)
[13:59:31] <logmsgbot>	 !log esanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]]
[13:59:34] <icinga-wm>	 RECOVERY - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is OK: SSL OK - Certificate etcd-v3.codfw.wmnet valid until 2025-07-23 13:33:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/PKI
[13:59:36] <stashbot>	 T397818: "Invalid response from server" when switching to VE source mode - https://phabricator.wikimedia.org/T397818
[13:59:41] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163786 (https://phabricator.wikimedia.org/T391208)
[13:59:46] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163787 (https://phabricator.wikimedia.org/T391208)
[14:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1400)
[14:00:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] k8s.wipe-cluster: Run puppet in batches of 50 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[14:01:56] <wikibugs>	 (03PS6) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[14:01:57] <logmsgbot>	 !log esanders@deploy1003 esanders: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:02:00] <wikibugs>	 (03Abandoned) 10Volans: Revert "redfish: add support for iDRAC 10" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163783 (owner: 10Volans)
[14:02:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.wipe-cluster: Ask user to confirm target k8s version [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm)
[14:02:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane error for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T397829#10947289 (10Jclark-ctr) Confirmed: Service Request 211933253
[14:03:31] <wikibugs>	 (03Abandoned) 10Jforrester: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163786 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester)
[14:03:35] <wikibugs>	 (03Abandoned) 10Jforrester: wikifunctions: Update orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163787 (https://phabricator.wikimedia.org/T391208) (owner: 10Jforrester)
[14:04:25] <wikibugs>	 (03PS1) 10Volans: redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851)
[14:04:38] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10947300 (10Jclark-ctr)
[14:04:39] <logmsgbot>	 !log esanders@deploy1003 esanders: Continuing with sync
[14:04:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T395241)', diff saved to https://phabricator.wikimedia.org/P78684 and previous config saved to /var/cache/conftool/dbconfig/20250625-140446-fceratto.json
[14:05:05] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10947301 (10Volans) By trial and error with Luca we found that the Target parameter wants a list now. Sent new fix.
[14:05:22] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:05:29] <wikibugs>	 06SRE, 06collaboration-services, 10Observability-Alerting, 13Patch-For-Review, 10SRE Observability (FY2025/2026-Q1): create a new place for prometheus/alertmanager checks not tied to physical machines - https://phabricator.wikimedia.org/T397264#10947307 (10lmata)
[14:05:53] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:07:45] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:08:23] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:08:37] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:09:34] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:09:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10SRE Observability (FY2025/2026-Q1): librenms-syslog leaks memory - https://phabricator.wikimedia.org/T397427#10947319 (10lmata)
[14:10:04] <wikibugs>	 (03PS7) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[14:10:58] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza)
[14:11:11] <logmsgbot>	 !log esanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163768|ArticleTarget: Avoid using chained promises with different return values (T397818)]] (duration: 11m 40s)
[14:11:17] <stashbot>	 T397818: "Invalid response from server" when switching to VE source mode - https://phabricator.wikimedia.org/T397818
[14:11:28] <icinga-wm>	 PROBLEM - Host wikikube-worker1243 is DOWN: PING CRITICAL - Packet loss = 100%
[14:12:05] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.89
[14:12:08] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8a
[14:12:37] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-06-17-205547 to 2025-06-23-151702 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163784 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza)
[14:13:45] <wikibugs>	 (03PS8) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[14:13:56] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:14:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:34] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:15:11] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet
[14:15:27] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:16:08] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:16:18] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:16:26] <wikibugs>	 (03PS9) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[14:17:08] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:17:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2161 T397164', diff saved to https://phabricator.wikimedia.org/P78685 and previous config saved to /var/cache/conftool/dbconfig/20250625-141729-ladsgroup.json
[14:17:35] <stashbot>	 T397164: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T397164
[14:17:42] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet
[14:17:43] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza)
[14:19:22] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-06-18-130945 to 2025-06-24-204920 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163775 (https://phabricator.wikimedia.org/T391208) (owner: 10Genoveva Galarza)
[14:19:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P78686 and previous config saved to /var/cache/conftool/dbconfig/20250625-141953-fceratto.json
[14:20:29] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:20:48] <logmsgbot>	 !log gengh@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:21:01] <wikibugs>	 (03PS3) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[14:21:09] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:21:15] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet
[14:21:34] <logmsgbot>	 !log gengh@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:21:44] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:21:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[14:22:07] <logmsgbot>	 !log gengh@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:22:29] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[14:23:25] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet
[14:25:08] <wikibugs>	 (03CR) 10Elukey: [C:03+1] redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans)
[14:25:33] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8a
[14:25:35] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8b
[14:27:08] <wikibugs>	 (03CR) 10Volans: [C:03+2] redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans)
[14:27:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[14:28:57] <hnowlan>	 swfrench-wmf: would you mind if I snuck in a mobileapps deploy? 
[14:29:44] <swfrench-wmf>	 hnowlan: go for it! I'm in a holding pattern for moment and will likely revert conf2006 shortly :)
[14:29:57] <hnowlan>	 ah, okay! 
[14:30:10] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[14:30:22] <wikibugs>	 (03PS4) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[14:30:54] <wikibugs>	 (03PS1) 10Scott French: Revert "hieradata: pilot cfssl/pki for nginx on conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1163798 (https://phabricator.wikimedia.org/T352245)
[14:31:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[14:31:47] <wikibugs>	 (03CR) 10Herron: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[14:31:50] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: set num_workers to 0, triple replicas in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163772 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[14:32:29] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "hieradata: pilot cfssl/pki for nginx on conf2006" [puppet] - 10https://gerrit.wikimedia.org/r/1163798 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[14:33:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1400)
[14:33:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1430)
[14:33:22] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:34:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:35:10] <wikibugs>	 (03PS1) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800
[14:36:01] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: actually support iDRAC 10 for SCP [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163788 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans)
[14:37:08] <wikibugs>	 (03PS1) 10JHathaway: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163801
[14:38:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:39:38] <claime>	 That's probably the mobileapps redeploy cc hnowlan ^
[14:39:43] <swfrench-wmf>	 !log reverted etcd tlsproxy to cergen certs on conf2006 - T352245
[14:39:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[14:39:44] <jinxer-wm>	 Deployment mobileapps-canary in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-canary - ...
[14:39:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:48] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[14:39:59] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet
[14:40:01] <claime>	 We should wait a bit see how it stabilizes, and maybe up replica count
[14:40:03] <hnowlan>	 claime: erk, looking
[14:40:14] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151)
[14:40:20] <claime>	 hnowlan: rps is going back down already
[14:40:22] <hnowlan>	 already dropping but yeah, probably worth making a change once things level out
[14:40:29] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8b
[14:40:32] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8c
[14:41:10] <hnowlan>	 RPS is still climbing on mobileapps so we'll see 
[14:41:16] <claime>	 ack
[14:42:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[14:43:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int releases routed via main at eqiad: 23.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:43:20] <wikibugs>	 (03PS4) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484)
[14:45:22] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[14:45:50] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet
[14:46:07] <wikibugs>	 (03PS2) 10Ayounsi: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800
[14:46:37] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet
[14:46:49] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163803 (https://phabricator.wikimedia.org/T396151) (owner: 10Clare Ming)
[14:46:50] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1043.eqiad.wmnet
[14:46:51] <wikibugs>	 (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[14:47:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163801 (owner: 10JHathaway)
[14:47:33] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[14:47:41] <wikibugs>	 (03PS3) 10Effie Mouzeli: site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994)
[14:47:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] site.pp: make wikikube-worker-exp2001 a k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1160238 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[14:48:06] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[14:48:40] <swfrench-wmf>	 !log incrementally restarting confds in codfw, ulsfo, eqsin - T352245
[14:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:46] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[14:50:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:52:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[14:52:34] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet
[14:52:53] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1043.eqiad.wmnet
[14:54:14] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8c
[14:54:16] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8d
[14:57:36] <wikibugs>	 (03PS5) 10Vgutierrez: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484)
[14:58:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:00:49] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1042.eqiad.wmnet
[15:04:32] <wikibugs>	 (03CR) 10Ssingh: "Looking good, thanks for working on it!" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:05:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:51] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1042.eqiad.wmnet
[15:06:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db2161', diff saved to https://phabricator.wikimedia.org/P78689 and previous config saved to /var/cache/conftool/dbconfig/20250625-150657-fceratto.json
[15:07:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:08:34] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8d
[15:08:37] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8e
[15:10:14] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2050.codfw.wmnet
[15:10:50] <wikibugs>	 (03CR) 10Cathal Mooney: "Ok, let me do that elsewhere and rebase and see if I can mangle it that way." [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:12:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2161', diff saved to https://phabricator.wikimedia.org/P78690 and previous config saved to /var/cache/conftool/dbconfig/20250625-151210-fceratto.json
[15:12:45] <wikibugs>	 (03PS1) 10Aude: Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812
[15:14:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985)
[15:14:49] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1076-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:15:20] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351
[15:15:39] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 (owner: 10Jgiannelos)
[15:15:40] <wikibugs>	 (03PS5) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[15:16:06] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2050.codfw.wmnet
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[15:16:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:17:24] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 (owner: 10Jgiannelos)
[15:19:44] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1252-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:20:13] <wikibugs>	 (03PS1) 10Abijeet Patro: Mobile editor: restore VE toolbar position [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163814 (https://phabricator.wikimedia.org/T397840)
[15:20:52] <wikibugs>	 (03PS2) 10Volans: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696)
[15:20:53] <wikibugs>	 (03PS2) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696)
[15:21:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Sretest: remove temporary additions testing dns repo stuff [puppet] - 10https://gerrit.wikimedia.org/r/1163813 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:22:06] <wikibugs>	 (03PS6) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:22:39] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:23:05] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:23:07] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8e
[15:23:10] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.8f
[15:23:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove volunteer access from analytics-privatedata-users group - https://phabricator.wikimedia.org/T397850 (10mmartorana) 03NEW
[15:23:38] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:24:10] <wikibugs>	 (03PS10) 10Cathal Mooney: Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985)
[15:24:17] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1179.eqiad.wmnet with OS bullseye
[15:24:37] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:24:39] <wikibugs>	 (03PS6) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[15:24:56] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2161 - Depooling to then set weight
[15:25:03] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2161 - Depooling to then set weight
[15:25:04] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[15:25:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[15:25:58] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.remove-downtime for 14 hosts
[15:26:04] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 14 hosts
[15:26:56] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1166-1168,1240-1242,1244-1251].eqiad.wmnet
[15:26:59] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1166-1168,1240-1242,1244-1251].eqiad.wmnet
[15:30:05] <wikibugs>	 (03CR) 10Cathal Mooney: "Ok thanks, sorry it's a long way from production ready, submitting it a bit earlier than I would to test with test-cookbook.  Great to get" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney)
[15:30:21] <wikibugs>	 (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:30:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:31:44] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
[15:32:03] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:33:14] <wikibugs>	 (03PS7) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:33:31] <wikibugs>	 (03CR) 10Ssingh: "Updates Hosts: in commit message to fail fast to debug; will revert later." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:33:45] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: Backplane failure for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T397851 (10Clement_Goubert) 03NEW
[15:34:08] <wikibugs>	 (03PS7) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[15:34:16] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:34:42] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:34:43] <wikibugs>	 (03PS3) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581)
[15:34:44] <logmsgbot>	 !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wikikube-worker1243.eqiad.wmnet with reason: hw failure
[15:34:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[15:34:59] <wikibugs>	 (03PS8) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:34:59] <claime>	 !log homer "cr*eqiad*" commit 'wikikube-worker1243 failed'
[15:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:43] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:36:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "PCC looks happy so I am a mere mortal." [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:36:27] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc2002.codfw.wmnet with reason: Decom
[15:37:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:37:57] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.8f
[15:37:59] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.90
[15:38:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:38:40] <wikibugs>	 (03PS9) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:38:45] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:38:56] <wikibugs>	 (03PS1) 10AOkoth: doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130)
[15:39:01] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:39:14] <wikibugs>	 (03PS8) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[15:39:59] <wikibugs>	 (03PS2) 10AOkoth: doc: decom doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130)
[15:40:49] <logmsgbot>	 stevemunene@cumin1002 reimage (PID 209642) is awaiting input
[15:41:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Authdns: clone new netbox-generated DNS records repo [puppet] - 10https://gerrit.wikimedia.org/r/1163766 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney)
[15:42:33] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[15:42:48] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-esams and A:cp - 9.2.11 upgrade (T397456)
[15:42:54] <stashbot>	 T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456
[15:43:09] <wikibugs>	 (03PS4) 10Ladsgroup: tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581)
[15:43:12] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:43:14] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Fix visibility of four tables based on maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1155336 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[15:43:31] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:58] <akosiaris>	 !log deploy GlobalNetworkPolicy targetting kube-dns by service on aux-k8s, dse-k8s, ml-serve, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1161535
[15:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:50] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852 (10phaultfinder) 03NEW
[15:45:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[15:45:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Update contract end date for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1163817
[15:45:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:45:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P78692 and previous config saved to /var/cache/conftool/dbconfig/20250625-154529-fceratto.json
[15:45:35] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[15:46:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P78693 and previous config saved to /var/cache/conftool/dbconfig/20250625-154637-fceratto.json
[15:46:40] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2051.codfw.wmnet
[15:47:11] <topranks>	 !log run puppet on dns3003 to clone new repo with netbox generated dns records 
[15:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:21] <wikibugs>	 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#10947812 (10elukey)
[15:47:22] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.11 upgrade (T390912)
[15:47:28] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[15:48:05] <wikibugs>	 (03CR) 10AOkoth: "I've silenced the alerting for this host so merging should not result in any noise." [puppet] - 10https://gerrit.wikimedia.org/r/1163816 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth)
[15:48:31] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:11] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:49:26] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:49:33] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:50:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:51:05] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:51:07] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.90
[15:51:09] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.91
[15:51:33] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:52:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update contract end date for toluayo [puppet] - 10https://gerrit.wikimedia.org/r/1163817 (owner: 10Muehlenhoff)
[15:52:04] <wikibugs>	 (03CR) 10Btullis: [C:03+1] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1163777 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene)
[15:52:26] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2051.codfw.wmnet
[15:52:27] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:52:36] <wikibugs>	 (03PS10) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:52:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin: allow dcops to use perccli and storcli via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1161382 (https://phabricator.wikimedia.org/T395939) (owner: 10Elukey)
[15:52:39] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:53:12] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:53:39] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:53:53] <wikibugs>	 (03CR) 10Btullis: hdfs: Assign the right role to new hadoop workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163778 (https://phabricator.wikimedia.org/T397615) (owner: 10Stevemunene)
[15:54:31] <wikibugs>	 (03PS11) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:58:31] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:59:58] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T397852#10947851 (10phaultfinder)
[16:01:38] <wikibugs>	 (03PS9) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[16:02:26] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1179.eqiad.wmnet with OS bullseye
[16:02:58] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1167 gradually with 4 steps - Pooling in
[16:03:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[16:03:31] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:04:53] <wikibugs>	 (03CR) 10Elukey: [C:03+1] images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[16:05:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Thanks and sorry, my bad!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans)
[16:06:00] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Minor typo inline, LGTM otherwise" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:06:03] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.91
[16:06:06] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.92
[16:06:41] <wikibugs>	 (03CR) 10Volans: [C:03+2] images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[16:06:51] <wikibugs>	 (03CR) 10Volans: [C:03+2] src_packages: add migration for OS model [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans)
[16:06:53] <wikibugs>	 (03PS4) 10Muehlenhoff: Depend on libjs-bootstrap4 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696)
[16:07:14] <wikibugs>	 (03CR) 10Muehlenhoff: Depend on libjs-bootstrap4 (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:07:26] <wikibugs>	 (03Merged) 10jenkins-bot: images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans)
[16:07:42] <wikibugs>	 (03Merged) 10jenkins-bot: src_packages: add migration for OS model [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans)
[16:07:44] <wikibugs>	 (03PS10) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[16:10:27] <wikibugs>	 (03PS1) 10Hnowlan: Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823
[16:10:56] <wikibugs>	 (03PS11) 10Muehlenhoff: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696)
[16:11:48] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan)
[16:12:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan)
[16:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[16:13:31] <jinxer-wm>	 FIRING: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:13:45] <hnowlan>	 ^ working on this
[16:14:05] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:14:24] <wikibugs>	 (03PS12) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:14:32] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Deploy node20 upgrade to prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163823 (owner: 10Hnowlan)
[16:14:33] <jinxer-wm>	 RESOLVED: ProbeDown: Service mobileapps:4102 has failed probes (http_mobileapps_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mobileapps:4102 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:14:35] <wikibugs>	 (03CR) 10DLynch: [C:04-1] Deploy EditCheck's multi-check mode everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[16:15:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[16:15:51] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:15:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[16:16:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:16:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-esams (185.15.59.149) - group Confed_esams - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_esams&var-bgp_neighbor=cr1-esams - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:16:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:17:46] <logmsgbot>	 !log cmooney@dns3003 START - running authdns-update
[16:18:45] <logmsgbot>	 !log cmooney@dns3003 END - running authdns-update
[16:19:39] <wikibugs>	 (03PS13) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:20:21] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.92
[16:20:24] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.93
[16:20:51] <jinxer-wm>	 RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:20:53] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6075/co" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:21:02] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 30305632 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:21:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[16:22:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6403032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:22:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#10947906 (10elukey) Deployed! @Jclark-ctr please test and report back if anything is missing :) Puppet is currently rolling out the change, so give it one hour to pro...
[16:22:59] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Success on cp hosts in eqiad; re-adding all previous Hosts and running PCC again." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:23:53] <wikibugs>	 (03PS14) 10Ssingh: cache::haproxy: Simplify cert configuration [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:25:33] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Actually question inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/1163741 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:27:40] <wikibugs>	 (03PS1) 10Elukey: WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826
[16:27:43] <wikibugs>	 (03PS12) 10Volans: Unvendor Bootstrap [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:29:08] <wikibugs>	 (03CR) 10Volans: "question inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163745 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff)
[16:29:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: add support for kubernetes [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/1163826 (owner: 10Elukey)
[16:29:25] <wikibugs>	 (03PS2) 10DLynch: Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[16:29:36] <wikibugs>	 (03CR) 10DLynch: [C:03+1] Deploy EditCheck's multi-check mode everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[16:31:13] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6076/c" [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[16:33:19] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1167 gradually with 4 steps - Pooling in
[16:33:21] <jinxer-wm>	 FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy  - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:33:47] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.93
[16:33:50] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.94
[16:43:06] <wikibugs>	 (03PS3) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696)
[16:43:12] <wikibugs>	 (03PS1) 10Hnowlan: mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750)
[16:44:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:48:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:48:30] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.94
[16:48:33] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.95
[16:54:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:56:08] <logmsgbot>	 !log cgoubert@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on P{wikikube-worker[1252-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700)
[17:00:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:01:56] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.95
[17:01:59] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.96
[17:09:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders)
[17:10:21] <hnowlan>	 jouncebot: nowandnext
[17:10:21] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700)
[17:10:21] <jouncebot>	 In 0 hour(s) and 49 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800)
[17:10:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[17:13:08] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-codfw and A:cp - 9.2.11 upgrade (T390912)
[17:13:14] <stashbot>	 T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912
[17:13:17] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate ThreadRevision unserialize() errors [puppet] - 10https://gerrit.wikimedia.org/r/1163833 (https://phabricator.wikimedia.org/T259111)
[17:14:34] <wikibugs>	 (03CR) 10Scott French: "Thanks for the reviews!" [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French)
[17:14:45] <wikibugs>	 (03CR) 10Scott French: [C:03+2] wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French)
[17:15:06] <logmsgbot>	 !log swfrench@dns1004 START - running authdns-update
[17:16:10] <logmsgbot>	 !log swfrench@dns1004 END - running authdns-update
[17:16:53] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.96
[17:16:56] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.97
[17:18:32] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-esams and A:cp - 9.2.11 upgrade (T397456)
[17:18:39] <stashbot>	 T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456
[17:21:35] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: remove swift-r[ow] from service catalog (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163397 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French)
[17:22:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:22:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[17:22:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:24:52] <icinga-wm>	 RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:25:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[17:27:03] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[17:27:40] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "This is ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1155318 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[17:27:41] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.11 upgrade (T397456)
[17:27:48] <stashbot>	 T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456
[17:28:45] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: use guaranteed QoS resource allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163828 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan)
[17:29:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:30:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:31:23] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.97
[17:31:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.98
[17:32:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[17:32:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[17:32:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:35:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:41:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Nice cleanup, much needed. I verified my own changes and left a few simple comments." [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[17:43:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948154 (10VRiley-WMF) Unracked lvs1017 and installing the card now
[17:45:32] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.98
[17:45:35] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.99
[17:46:55] <stephanebisson>	 jouncebot nowandnext
[17:46:55] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1700)
[17:46:56] <jouncebot>	 In 0 hour(s) and 13 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800)
[17:49:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948171 (10VRiley-WMF)
[17:50:58] <stephanebisson>	 I have a ContentTranslation UBN to deploy. It will affect a lot of users in group 1. Could I do it before the train?
[17:51:11] <dancy>	 Go for it
[17:52:24] <stephanebisson>	 dancy thanks. Waiting for CI. Will keep you updated.
[17:52:35] <wikibugs>	 (03PS1) 10Ssingh: hiera: cache/{text,upload}: use aliases for SANs [puppet] - 10https://gerrit.wikimedia.org/r/1163837
[17:53:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:54:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:50] <wikibugs>	 (03PS1) 10Sbisson: CX3 Build 1.0.0+20250625 [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840)
[17:56:19] <wikibugs>	 (03CR) 10Scott French: [C:03+2] conftool-data: remove swift-r[ow] discovery entities (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163398 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French)
[17:56:24] <wikibugs>	 (03CR) 10Ssingh: "I will run PCC on this after the parent CR is merged, otherwise the SNR is terrible." [puppet] - 10https://gerrit.wikimedia.org/r/1163837 (owner: 10Ssingh)
[17:56:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:56:39] <wikibugs>	 (03PS1) 10JHathaway: Add vendor exclusion to DHCPConfMac [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839
[17:56:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:56:44] <jinxer-wm>	 Deployment mw-experimental.eqiad.pinkllama in mw-experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-experimental&var-deployment=mw-experimental.eqiad.pinkllama - ...
[17:56:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:56:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:58:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:59:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840) (owner: 10Sbisson)
[18:00:05] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T1800)
[18:01:09] <stephanebisson>	 FYI: I'm squeezing a ContentTranslation fix on wmf.7 before the train
[18:01:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:01:12] <logmsgbot>	 !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=1) Checking container DBs of wikipedia-commons-local-thumb.99
[18:01:14] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9a
[18:01:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[18:01:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:05:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:07:44] <jeena>	 stephanebisson:  Please give me a ping when it's ready for me 
[18:09:08] <stephanebisson>	 jeena will do
[18:10:12] <wikibugs>	 (03PS1) 10Ssingh: P:cache::haproxy: properly indent profile (NOOP) [puppet] - 10https://gerrit.wikimedia.org/r/1163842
[18:10:17] <jeena>	 Thanks!
[18:10:45] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250625 [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163838 (https://phabricator.wikimedia.org/T397840) (owner: 10Sbisson)
[18:10:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:11:11] <logmsgbot>	 !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]]
[18:11:20] <stashbot>	 T397840: SX Mobile editor has no toolbar on test wikipedia - https://phabricator.wikimedia.org/T397840
[18:12:43] <wikibugs>	 (03PS2) 10BryanDavis: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[18:13:31] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:14:24] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9a
[18:14:24] <wikibugs>	 (03PS1) 10Ssingh: nagios_common and P:cache::haproxy: s/ats/haproxy for SSL checks [puppet] - 10https://gerrit.wikimedia.org/r/1163843
[18:14:26] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9b
[18:14:30] <wikibugs>	 (03CR) 10BryanDavis: "PS2 is a manual rebase on Ic34f8304f9a4aa77e6ae1897cd2c0a3160363985. This will be reapplied on deployment-puppetserver-1 to resolve T39771" [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[18:15:17] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] cache::haproxy: Simplify cert configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163749 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[18:15:41] <wikibugs>	 (03PS1) 10Ladsgroup: table-catalog: Fix private status of a couple of tables [puppet] - 10https://gerrit.wikimedia.org/r/1163844
[18:15:53] <logmsgbot>	 !log sbisson@deploy1003 sbisson: Continuing with sync
[18:15:56] <andre>	 I'd like to fix wrong stuff on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests . Who do I need to bribe?
[18:16:48] <sukhe>	 depends on what stuff you are looking to fix but you can join the clinic duty channel and then decide. 
[18:18:03] <andre>	 thanks
[18:18:31] <wikibugs>	 (03PS7) 10Ladsgroup: mariadb: Load list of private tables from the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581)
[18:18:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] table-catalog: Fix private status of a couple of tables [puppet] - 10https://gerrit.wikimedia.org/r/1163844 (owner: 10Ladsgroup)
[18:19:15] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: remove swift-r[ow] SAN entries (cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/1163407 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French)
[18:19:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:21:31] <logmsgbot>	 !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163838|CX3 Build 1.0.0+20250625 (T397840)]] (duration: 10m 20s)
[18:21:37] <stashbot>	 T397840: SX Mobile editor has no toolbar on test wikipedia - https://phabricator.wikimedia.org/T397840
[18:22:30] <stephanebisson>	 jeena your turn
[18:27:51] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9b
[18:27:54] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9c
[18:28:05] <wikibugs>	 (03PS1) 10Sbisson: CX instrumentation: Fix translation providers in desktop editor events [extensions/ContentTranslation] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163845 (https://phabricator.wikimedia.org/T395493)
[18:29:17] <wikibugs>	 (03PS1) 10Ladsgroup: [WIP] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581)
[18:29:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:30:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948297 (10VRiley-WMF)
[18:30:36] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[18:32:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10948306 (10VRiley-WMF) Inserted new NIC. Moved the server to the new location (E2, U39, Port 39), ran the netbox script, and everything went through smoothly. @BCornwall it should be ready for the...
[18:32:44] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177)
[18:32:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot)
[18:32:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:33:34] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163849 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot)
[18:41:14] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.7  refs T392177
[18:41:21] <stashbot>	 T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177
[18:41:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:42:32] <wikibugs>	 (03PS2) 10Ladsgroup: [WIP] Use table catalog for fullViews [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581)
[18:42:46] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9c
[18:42:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9d
[18:44:17] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[18:47:50] <wikibugs>	 (03PS1) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794)
[18:47:59] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163839 (owner: 10JHathaway)
[18:48:14] <wikibugs>	 (03CR) 10Ladsgroup: "I'm very confused, locally the result is now ordered but not in PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[18:48:17] <wikibugs>	 (03PS1) 10Dwisehaupt: icinga: decommission frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868)
[18:48:22] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163846 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[18:48:56] <wikibugs>	 (03CR) 10Dwisehaupt: [C:04-1] "Marking -1 until machines are powered off and ready for decom." [puppet] - 10https://gerrit.wikimedia.org/r/1163851 (https://phabricator.wikimedia.org/T397868) (owner: 10Dwisehaupt)
[18:49:12] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev200[23]: setup for (no reuse) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1163852 (https://phabricator.wikimedia.org/T391544)
[18:49:14] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev2002: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163853 (https://phabricator.wikimedia.org/T391544)
[18:49:15] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev2003: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163854 (https://phabricator.wikimedia.org/T391544)
[18:50:59] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqiad and A:cp - 9.2.11 upgrade (T397456)
[18:51:04] <stashbot>	 T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456
[18:51:44] <wikibugs>	 (03PS2) 10AOkoth: os_updates: manage stylesheet with puppet [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794)
[18:52:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev200[23]: setup for (no reuse) reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1163852 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[18:55:19] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1163850/6079/" [puppet] - 10https://gerrit.wikimedia.org/r/1163850 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[18:56:17] <wikibugs>	 (03PS1) 10Ssingh: P:bird and C:bird::anycast: support exporting Prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619)
[18:56:20] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9d
[18:56:23] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9e
[18:57:28] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6080/console" [puppet] - 10https://gerrit.wikimedia.org/r/1163858 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[18:57:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:58:03] <wikibugs>	 (03PS3) 10JHathaway: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[18:58:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:59:35] <wikibugs>	 (03PS1) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619)
[19:00:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[19:00:46] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619) (owner: 10Ssingh)
[19:00:57] <wikibugs>	 (03PS2) 10Ssingh: hiera: enable exporting prom metrics from doh1001 for anycast-hc [puppet] - 10https://gerrit.wikimedia.org/r/1163859 (https://phabricator.wikimedia.org/T374619)
[19:04:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[19:06:40] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS bullseye
[19:06:53] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2002....
[19:08:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:10:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:12:07] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9e
[19:12:09] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.9f
[19:20:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:21:22] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm
[19:21:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:23:04] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[19:25:46] <wikibugs>	 (03PS3) 10Scott French: hieradata: remove mw-wikifunctions discovery services [puppet] - 10https://gerrit.wikimedia.org/r/1163856 (https://phabricator.wikimedia.org/T384944)
[19:26:00] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.9f
[19:26:03] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a0
[19:26:24] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[19:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:26:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:28:48] <wikibugs>	 (03CR) 10Scott French: "I happened to notice this while working on the swift-r[ow] turndown earlier today." [puppet] - 10https://gerrit.wikimedia.org/r/1163856 (https://phabricator.wikimedia.org/T384944) (owner: 10Scott French)
[19:34:59] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1003.eqiad.wmnet with OS bookworm
[19:38:41] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone policy: allow object_storage role to create/delete ec2 creds [puppet] - 10https://gerrit.wikimedia.org/r/1163864 (https://phabricator.wikimedia.org/T396594)
[19:40:26] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a0
[19:40:29] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a1
[19:41:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:44:28] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS bullseye
[19:44:46] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2002.codf...
[19:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[19:46:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[19:48:10] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev2002: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163853 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[19:49:52] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[19:50:59] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm
[19:54:14] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a1
[19:54:17] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a2
[19:57:27] <wikibugs>	 (03CR) 10Ebomani: [C:03+1] "Looks good to me! Tested and verified that for the new (non-legacy) Patchdemo related changes we get redirect links in the 'Checks' tab to" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2000).
[20:00:05] <jouncebot>	 arlolra and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <Kemayo>	 o/
[20:00:47] <wikibugs>	 (03PS4) 10JHathaway: reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[20:01:09] <cjming>	 hi - if a deployer is needed, i can deploy
[20:01:17] <arlolra>	 here
[20:01:26] <arlolra>	 I can handle my deploy
[20:01:31] <Kemayo>	 I'm fine doing mine, too.
[20:01:42] <arlolra>	 Kemayo: I'll get started?
[20:01:49] <Kemayo>	 arlolra: Go for it, you're first in the list.
[20:01:55] <arlolra>	 Ok
[20:02:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[20:03:22] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy VipsScaler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[20:03:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:47] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]]
[20:03:53] <stashbot>	 T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759
[20:04:33] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[20:04:55] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm
[20:06:21] <wikibugs>	 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10948585 (10Scott_French) 05Open→03Resolved This is done now. Thanks for the reviews, all!
[20:06:33] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[20:06:54] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm
[20:07:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts - try #2 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163800 (owner: 10Ayounsi)
[20:08:55] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a2
[20:08:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a3
[20:09:41] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[20:09:43] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm
[20:10:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 603.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:10:52] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[20:10:57] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2004.codfw.wmnet with OS bookworm
[20:13:29] <jinxer-wm>	 FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[20:13:54] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[20:16:11] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye
[20:16:30] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948605 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host cassandra-dev2003....
[20:16:59] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Disable read only backups and reenable regular rw es backups [puppet] - 10https://gerrit.wikimedia.org/r/1163694 (https://phabricator.wikimedia.org/T387892)
[20:18:58] <arlolra>	 Hmm, it seems to be "Building container images" for an inordinate amount of time
[20:19:09] <arlolra>	 cjming: Kemayo: any ideas/
[20:19:51] <Kemayo>	 arlolra: I haven't seen a stall on that particular one before, sorry.
[20:19:57] <cjming>	 me neither
[20:20:24] <wikibugs>	 (03PS3) 10BryanDavis: [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[20:20:24] <dancy>	 arlolro: Looks like localisation files were rebuilt:   `537 languages rebuilt out of 537`
[20:20:36] <Kemayo>	 Aha, the sympathetic magic of asking about it has caused it to progress.
[20:20:42] <arlolra>	 :)
[20:20:50] <dancy>	 That results in several gigabytes of data being generated which takes a long time to containerize and sync.
[20:21:26] <arlolra>	 dancy: thanks.  Was that from /var/lib/spiderpig/scap-image-build-and-push-log ?
[20:21:43] <dancy>	 I looked in the job log: https://spiderpig.wikimedia.org/jobs/253
[20:22:26] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a3
[20:22:28] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a4
[20:22:30] <arlolra>	 Ok
[20:22:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [BETA HACK] Changes to profile::puppetserver::volatile [puppet] - 10https://gerrit.wikimedia.org/r/1137013 (owner: 10Krinkle)
[20:22:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:22:57] <wikibugs>	 (03CR) 10Volans: "Alternative approach suggestion inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163732 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi)
[20:23:23] <aude>	 I would like to deploy an update to the chart-renderer service https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1163812 (maybe when backports are done, though the service deploy is unrelated)
[20:23:29] <dancy>	 arlolra: The usual cause of this is a direct change to a localisation json file.  But in this case the change is indirect due to the removal of an extension and its associate l10n files.
[20:24:46] <arlolra>	 I see
[20:25:00] <dancy>	 aude: A parallel deployment should be fine.  We're just doing a lot of waiting at the moment.
[20:25:10] <aude>	 ok thanks
[20:25:27] <wikibugs>	 (03CR) 10Aude: [C:03+2] Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812 (owner: 10Aude)
[20:27:05] <wikibugs>	 (03Merged) 10jenkins-bot: Update the chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163812 (owner: 10Aude)
[20:27:41] <logmsgbot>	 !log aude@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[20:27:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:18] <logmsgbot>	 !log aude@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[20:28:58] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev2003: updated data_file_directories list [puppet] - 10https://gerrit.wikimedia.org/r/1163854 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[20:30:28] <logmsgbot>	 !log aude@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[20:31:03] <logmsgbot>	 !log aude@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[20:31:35] <logmsgbot>	 !log aude@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[20:32:07] <logmsgbot>	 !log aude@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[20:32:17] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage
[20:32:45] <wikibugs>	 (03PS1) 10Jgreen: Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865)
[20:33:04] <aude>	 I'm done. looks good and will be around to monitor
[20:34:41] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:34:47] <stashbot>	 T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759
[20:36:14] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage
[20:36:30] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[20:37:05] <Kemayo>	 Huh, I hadn't noticed before that people who aren't the user who started the spiderpig run also get the option to answer the "continue with sync?" question. :D
[20:37:10] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Continuing with sync
[20:37:13] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a4
[20:37:16] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a5
[20:38:39] <dancy>	 Keymayo: The commit owner is notified too
[20:40:03] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[20:45:35] <Kemayo>	 dancy: I'm not that either, thus my surprise as a completely-unrelated user.
[20:46:01] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877 (10bd808) 03NEW
[20:46:20] <dancy>	 Oh interesting.  Can you point me to the notification you're talking about?
[20:47:09] <Kemayo>	 Not a notification. When you're looking at https://spiderpig.wikimedia.org/ you see the "continue with sync? [yes] [no]" prompt inside the job-history on the currently-running job.
[20:47:52] <dancy>	 ooh, gotcha. Any user can respond to an interaction. That's right.  That's a deliberate behavior.
[20:48:33] <Kemayo>	 I figured it made sense as a way to avoid everything getting stuck because someone wandered away, it just caught me by surprise for a second. :D
[20:48:41] <dancy>	 Nod.
[20:50:33] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a5
[20:50:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a6
[20:51:25] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159599|Undeploy VipsScaler (T290759)]] (duration: 47m 37s)
[20:51:31] <stashbot>	 T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759
[20:51:43] <arlolra>	 Kemayo: sorry to have used up so much of the window
[20:51:54] <Kemayo>	 arlolra: It was more than I expected, but no worries.
[20:52:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders)
[20:52:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[20:53:00] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139470 (https://phabricator.wikimedia.org/T359815) (owner: 10Esanders)
[20:53:05] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy EditCheck's multi-check mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161937 (https://phabricator.wikimedia.org/T395519) (owner: 10Esanders)
[20:53:29] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]]
[20:53:37] <stashbot>	 T359815: Enable Visual Editor on Wikipedia namespace on Armenian Wikipedia - https://phabricator.wikimedia.org/T359815
[20:53:37] <stashbot>	 T395519: [Multi-Check] Deploy Multi-Check (References) to all Wikipedias - https://phabricator.wikimedia.org/T395519
[20:56:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:58:07] <logmsgbot>	 !log kemayo@deploy1003 kemayo, esanders: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:59:01] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm
[20:59:36] <logmsgbot>	 !log kemayo@deploy1003 kemayo, esanders: Continuing with sync
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2100)
[21:01:19] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+1] Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen)
[21:01:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:02:23] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye
[21:02:38] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10948772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host cassandra-dev2003.codf...
[21:03:46] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a6
[21:03:49] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a7
[21:06:15] <wikibugs>	 (03CR) 10Jgreen: [C:03+2] Add payments-a-eqiad.wikimedia.org A/PTR records. [dns] - 10https://gerrit.wikimedia.org/r/1163876 (https://phabricator.wikimedia.org/T397865) (owner: 10Jgreen)
[21:06:30] <logmsgbot>	 !log jgreen@dns1004 START - running authdns-update
[21:06:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:07:08] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139470|Enable VE in Project (Wikipedia/Վիքիպեդիա) namespace at hywiki (T359815)]], [[gerrit:1161937|Deploy EditCheck's multi-check mode everywhere (T395519)]] (duration: 13m 38s)
[21:07:14] <stashbot>	 T359815: Enable Visual Editor on Wikipedia namespace on Armenian Wikipedia - https://phabricator.wikimedia.org/T359815
[21:07:15] <stashbot>	 T395519: [Multi-Check] Deploy Multi-Check (References) to all Wikipedias - https://phabricator.wikimedia.org/T395519
[21:07:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:07:33] <logmsgbot>	 !log jgreen@dns1004 END - running authdns-update
[21:12:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:13:20] <wikibugs>	 (03PS1) 10BryanDavis: puppetserver: check for rebase in puppetserver-deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877)
[21:14:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:16:54] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a7
[21:16:57] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a8
[21:21:33] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#10948815 (10bd808) `lang=shell-session bd808@deployment-puppetserver-1:~$ sudo -i puppet agent -t...
[21:24:18] <wikibugs>	 (03PS1) 10Andrew Bogott: Cloudcephosd200[456]-dev: make ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163884 (https://phabricator.wikimedia.org/T397237)
[21:24:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Cloudcephosd200[456]-dev: make ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163884 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott)
[21:26:09] <wikibugs>	 (03CR) 10BryanDavis: [V:03+1] "Cherry-picked to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud and tested for desired behavior. See T397877#10948815 fo" [puppet] - 10https://gerrit.wikimedia.org/r/1163883 (https://phabricator.wikimedia.org/T397877) (owner: 10BryanDavis)
[21:26:59] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#10948820 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808
[21:27:43] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:31:22] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a8
[21:31:25] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.a9
[21:31:26] <wikibugs>	 (03PS1) 10Andrew Bogott: Add hiera for new cloudcephosd nodes in codfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1163885 (https://phabricator.wikimedia.org/T397237)
[21:32:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add hiera for new cloudcephosd nodes in codfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1163885 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott)
[21:33:32] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886
[21:34:05] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886 (owner: 10Ahmon Dancy)
[21:34:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:35:01] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163886 (owner: 10Ahmon Dancy)
[21:35:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:36:10] <wikibugs>	 (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888
[21:36:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888 (owner: 10Ahmon Dancy)
[21:37:29] <wikibugs>	 (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for search-chi-dnsdisc [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1163888 (owner: 10Ahmon Dancy)
[21:37:47] <wikibugs>	 (03PS1) 10Andrew Bogott: Cloudcephosd200[567]-dev: puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1163889 (https://phabricator.wikimedia.org/T397237)
[21:38:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Cloudcephosd200[567]-dev: puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1163889 (https://phabricator.wikimedia.org/T397237) (owner: 10Andrew Bogott)
[21:41:32] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye
[21:47:19] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.a9
[21:47:22] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.aa
[21:54:43] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:55:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:56:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:57:56] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250625T2200)
[22:03:28] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm
[22:03:37] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.aa
[22:03:37] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2005-dev.codfw.wmnet with reason: host reimage
[22:03:40] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ab
[22:16:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:18:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:18:46] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ab
[22:18:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ac
[22:20:09] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[22:21:02] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2005-dev.codfw.wmnet with OS bullseye
[22:23:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:24:04] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[22:24:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:25:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[22:27:04] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye
[22:27:06] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[22:29:47] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:29:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:32:50] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ac
[22:32:52] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ad
[22:37:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:40:55] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm
[22:41:59] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mobileapps-production in mobileapps at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[22:42:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:43:44] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage
[22:43:52] <logmsgbot>	 !log andrew@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[22:45:45] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:45:53] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:46:03] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:46:38] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ad
[22:46:41] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.ae
[22:47:24] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2006-dev.codfw.wmnet with reason: host reimage
[22:51:21] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[22:52:53] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:53:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:54:35] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:54:43] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54082 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10948977 (10Andrew) Currently we only have one NIC connected for each of these. Ports are scarce in that rack, so the plan (in too much detail) is:...
[23:00:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567]-dev - https://phabricator.wikimedia.org/T393614#10948980 (10Andrew) 05Resolved→03Open
[23:02:16] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.ae
[23:02:19] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.af
[23:03:28] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2006-dev.codfw.wmnet with OS bullseye
[23:03:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:06:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:08:47] <logmsgbot>	 !log andrew@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[23:16:12] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.af
[23:16:15] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b0
[23:21:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:22:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:27:15] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899
[23:29:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 (owner: 10Andrea Denisse)
[23:31:41] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b0
[23:31:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b1
[23:37:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:38:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900
[23:38:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900 (owner: 10TrainBranchBot)
[23:38:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:42:29] <wikibugs>	 (03PS2) 10Andrea Denisse: Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899
[23:45:41] <jinxer-wm>	 FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[23:46:07] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] Revert "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163899 (owner: 10Andrea Denisse)
[23:46:52] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.b1
[23:46:54] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.b2
[23:48:27] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert^2 "centrallog: Add a temporary rsyslog debug config file" [puppet] - 10https://gerrit.wikimedia.org/r/1163901
[23:49:39] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163900 (owner: 10TrainBranchBot)
[23:58:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:59:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures