[00:00:47] <wikibugs>	 (03Merged) 10jenkins-bot: Close crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214056 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe)
[00:01:22] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1214056|Close crwiki (T411501)]]
[00:01:26] <stashbot>	 T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501
[00:01:33] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance
[00:01:41] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86338 and previous config saved to /var/cache/conftool/dbconfig/20251203-000140-ladsgroup.json
[00:01:44] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[00:04:10] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1214056|Close crwiki (T411501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:05:16] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:09:21] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214056|Close crwiki (T411501)]] (duration: 07m 59s)
[00:09:24] <stashbot>	 T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501
[00:13:25] <wikibugs>	 (03PS4) 10Zabe: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501)
[00:15:34] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe)
[00:16:22] <wikibugs>	 (03Merged) 10jenkins-bot: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe)
[00:17:16] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1214057|Close klwiki (T411501)]]
[00:17:20] <stashbot>	 T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501
[00:17:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11426536 (10Dwisehaupt) @Jhancock.wm Yes I did and I can get in. I just checked again and it looks like something may have been missed. I see the host in netbox only has a mana...
[00:19:40] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1214057|Close klwiki (T411501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:20:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[00:24:45] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214057|Close klwiki (T411501)]] (duration: 07m 29s)
[00:24:48] <stashbot>	 T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501
[00:26:12] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275)
[00:26:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by zabe@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot)
[00:26:37] <zabe>	 ^ moving crwiki and klwiki to wmf.5
[00:27:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot)
[00:33:31] <logmsgbot>	 !log zabe@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.5  refs T408275
[00:33:35] <stashbot>	 T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275
[00:35:00] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[00:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:40:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190
[00:40:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190 (owner: 10TrainBranchBot)
[00:45:00] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[00:48:46] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[00:50:48] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS trixie
[00:52:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190 (owner: 10TrainBranchBot)
[00:58:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:01:00] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:03:48] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "There is no harm in merging them before the other stuff is done but there isn't much value either. We can decide tomorrow!" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[01:04:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for the patch! This should go in at the very end of the changes in a way. (I will upload a related patch for actually adding this n" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[01:08:56] <wikibugs>	 (03PS1) 10Ssingh: conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259)
[01:09:44] <wikibugs>	 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 (10RKemper) 03NEW
[01:10:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193
[01:10:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193 (owner: 10TrainBranchBot)
[01:14:31] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 30s)
[01:18:24] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[01:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:21:25] <logmsgbot>	 ryankemper@cumin2002 reboot-workers (PID 1626431) is awaiting input
[01:21:42] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[01:23:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster
[01:23:35] <wikibugs>	 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11426647 (10RKemper) an-worker* reboots ongoing now
[01:24:47] <wikibugs>	 (03PS5) 10RLazarus: deployment_server: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604
[01:25:00] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[01:25:23] <wikibugs>	 (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[01:33:25] <wikibugs>	 (03PS1) 10Sbisson: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195
[01:35:00] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[01:36:12] <wikibugs>	 (03CR) 10RLazarus: "Summarizing what we discussed after that:" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[01:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193 (owner: 10TrainBranchBot)
[01:39:48] <wikibugs>	 (03CR) 10CDanis: [C:03+1] conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh)
[01:43:44] <logmsgbot>	 andrew@cumin2002 reimage (PID 1612025) is awaiting input
[02:05:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:05:25] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1006.eqiad.wmnet with OS trixie
[02:13:26] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS trixie
[02:27:24] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage
[02:27:31] <wikibugs>	 (03PS4) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489)
[02:27:33] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle)
[02:28:07] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle)
[02:28:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.130 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:41] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:43] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.155 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:28:43] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:28:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:29:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.095 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.113 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.275 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:14] <sukhe>	 woah
[02:29:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1018.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[02:29:23] <sukhe>	 !incidents
[02:29:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.095 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:23] <sirenbot>	 7079 (UNACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[02:29:24] <sirenbot>	 7078 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr3-eqsin.wikimedia.org)
[02:29:24] <sirenbot>	 7077 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[02:29:24] <sirenbot>	 7072 (RESOLVED)  [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi)
[02:29:24] <sirenbot>	 7074 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[02:29:24] <sirenbot>	 7073 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[02:29:25] <sirenbot>	 7075 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad)
[02:29:27] <sukhe>	 !ack 7079
[02:29:28] <denisse>	 !incidents
[02:29:28] <sirenbot>	 7079 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[02:29:28] <sirenbot>	 7079 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[02:29:28] <sirenbot>	 7078 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr3-eqsin.wikimedia.org)
[02:29:29] <sirenbot>	 7077 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[02:29:29] <sirenbot>	 7072 (RESOLVED)  [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi)
[02:29:29] <sirenbot>	 7074 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[02:29:29] <sirenbot>	 7073 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[02:29:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.653 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.255 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:40] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 1.622 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:45] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:46] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:29:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:30:12] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:30:13] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:30:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:30:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:27] <denisse>	 !incidents
[02:30:28] <sirenbot>	 7079 (ACKED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[02:30:28] <sirenbot>	 7080 (UNACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[02:30:28] <sirenbot>	 7081 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[02:30:28] <sirenbot>	 7078 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr3-eqsin.wikimedia.org)
[02:30:28] <sirenbot>	 7077 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[02:30:28] <sukhe>	 !ack 7080
[02:30:29] <sirenbot>	 7072 (RESOLVED)  [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi)
[02:30:29] <sirenbot>	 7074 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[02:30:29] <sirenbot>	 7073 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[02:30:29] <sirenbot>	 7075 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad)
[02:30:30] <sirenbot>	 7080 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[02:30:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:37] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:38] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:40] <denisse>	 !ack 7081
[02:30:40] <sirenbot>	 7081 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[02:30:40] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:40] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.340 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:41] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.509 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:42] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.632 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:42] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.559 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:44] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.656 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:44] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.649 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:45] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:30:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[02:33:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:34:42] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage
[02:34:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90%  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation
[02:35:12] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[02:35:13] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:35:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:36:06] <wikibugs>	 (03CR) 10Krinkle: "Using the example of:" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle)
[02:36:46] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:36:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:40:14] <wikibugs>	 (03PS5) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489)
[02:40:20] <wikibugs>	 (03PS2) 10Krinkle: robots.php: Clean up unused site, lang, and x-subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122)
[02:41:21] <wikibugs>	 (03PS2) 10Krinkle: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150
[02:41:46] <jinxer-wm>	 FIRING: [3x] Primary outbound port utilisation over 80%  #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:44:51] <jinxer-wm>	 FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90%  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation
[02:45:51] <sukhe>	 !incidents
[02:45:51] <sirenbot>	 7082 (ACKED)  [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi)
[02:45:52] <sirenbot>	 7083 (ACKED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-eqiad.wikimedia.org)
[02:45:52] <sirenbot>	 7084 (ACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-eqiad.wikimedia.org)
[02:45:52] <sirenbot>	 7081 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule@main)
[02:45:52] <sirenbot>	 7080 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule@main)
[02:45:52] <sirenbot>	 7079 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[02:45:53] <sirenbot>	 7078 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr3-eqsin.wikimedia.org)
[02:45:53] <sirenbot>	 7077 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[02:45:53] <sirenbot>	 7072 (RESOLVED)  [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi)
[02:45:54] <sirenbot>	 7074 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[02:45:54] <sirenbot>	 7073 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-magru.wikimedia.org)
[02:45:55] <sirenbot>	 7075 (RESOLVED)  CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad)
[02:46:46] <jinxer-wm>	 FIRING: [3x] Primary outbound port utilisation over 80%  #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:46:46] <jinxer-wm>	 RESOLVED: [2x] Primary inbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[02:49:51] <jinxer-wm>	 RESOLVED: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90%  - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation
[02:50:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:51:34] <wikibugs>	 (03PS2) 10Krinkle: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023)
[02:51:46] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[02:54:26] <wikibugs>	 (03PS3) 10Krinkle: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023)
[02:54:27] <wikibugs>	 (03PS2) 10Krinkle: robots.txt: Clean up inline comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149
[02:54:27] <wikibugs>	 (03PS3) 10Krinkle: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150
[02:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:55:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:57:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) (owner: 10Krinkle)
[02:57:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[02:57:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 (owner: 10Krinkle)
[02:57:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 (owner: 10Krinkle)
[02:58:33] <wikibugs>	 (03Merged) 10jenkins-bot: robots.php: Clean up unused site, lang, and x-subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) (owner: 10Krinkle)
[02:58:35] <wikibugs>	 (03Merged) 10jenkins-bot: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[02:58:38] <wikibugs>	 (03Merged) 10jenkins-bot: robots.txt: Clean up inline comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 (owner: 10Krinkle)
[02:58:39] <wikibugs>	 (03Merged) 10jenkins-bot: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 (owner: 10Krinkle)
[02:59:36] <logmsgbot>	 !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]]
[02:59:41] <stashbot>	 T407122: [5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints - https://phabricator.wikimedia.org/T407122
[02:59:41] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[03:00:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:02:25] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] synced to the testservers (see https://wiki
[03:02:26] <logmsgbot>	 tech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[03:03:57] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Continuing with sync
[03:06:28] <wikibugs>	 (03PS1) 10Krinkle: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201
[03:08:02] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] (duration: 08m 26s)
[03:08:07] <stashbot>	 T407122: [5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints - https://phabricator.wikimedia.org/T407122
[03:08:07] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[03:08:36] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1006.eqiad.wmnet with OS trixie
[03:09:14] <wikibugs>	 (03PS2) 10Krinkle: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023)
[03:09:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[03:11:00] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup)
[03:13:52] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[03:14:41] <wikibugs>	 (03Merged) 10jenkins-bot: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[03:15:12] <logmsgbot>	 !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]]
[03:15:16] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[03:17:53] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[03:22:19] <logmsgbot>	 !log krinkle@deploy2002 krinkle: Continuing with sync
[03:26:20] <logmsgbot>	 !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] (duration: 11m 08s)
[03:26:24] <stashbot>	 T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023
[03:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:30:29] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.eqiad.wmnet with OS trixie
[03:46:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage
[03:50:06] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage
[04:26:16] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.eqiad.wmnet with OS trixie
[04:34:04] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1011.eqiad.wmnet with OS trixie
[04:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[04:43:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[04:48:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[04:53:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:54:40] <wikibugs>	 (03PS1) 10Marostegui: installserver: Add db1169 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1214220 (https://phabricator.wikimedia.org/T411498)
[04:55:05] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage
[04:57:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Add db1169 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1214220 (https://phabricator.wikimedia.org/T411498) (owner: 10Marostegui)
[04:57:20] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[04:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:58:52] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86339 and previous config saved to /var/cache/conftool/dbconfig/20251203-045851-marostegui.json
[04:58:56] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[04:58:56] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[04:59:40] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage
[05:01:24] <wikibugs>	 (03PS1) 10Marostegui: installserver: Change db1169 order [puppet] - 10https://gerrit.wikimedia.org/r/1214225
[05:02:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:02:05] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:02:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:02:53] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Change db1169 order [puppet] - 10https://gerrit.wikimedia.org/r/1214225 (owner: 10Marostegui)
[05:10:00] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:13:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86340 and previous config saved to /var/cache/conftool/dbconfig/20251203-051359-marostegui.json
[05:15:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[05:18:00] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11426756 (10Marostegui)
[05:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:20:25] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:20:50] <wikibugs>	 (03PS1) 10Marostegui: installserver: Move new hosts to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1214235 (https://phabricator.wikimedia.org/T411570)
[05:21:55] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:21:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:22:43] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Move new hosts to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1214235 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui)
[05:25:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[05:27:43] <marostegui>	 !log Drop sockpuppet database T411527
[05:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:46] <stashbot>	 T411527: Remove sockpuppet database - https://phabricator.wikimedia.org/T411527
[05:29:07] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86341 and previous config saved to /var/cache/conftool/dbconfig/20251203-052906-marostegui.json
[05:29:50] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527)
[05:30:28] <wikibugs>	 (03CR) 10Marostegui: "This is a NOOP as puppet does not control grants. The grants will have to be removed manually from the DB." [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui)
[05:31:33] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[05:32:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui)
[05:35:00] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:35:11] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[05:36:42] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1011.eqiad.wmnet with OS trixie
[05:40:17] <icinga-wm>	 PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100%
[05:41:19] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie
[05:44:15] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86342 and previous config saved to /var/cache/conftool/dbconfig/20251203-054414-marostegui.json
[05:44:19] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[05:44:20] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[05:44:31] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance
[05:44:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86343 and previous config saved to /var/cache/conftool/dbconfig/20251203-054438-marostegui.json
[05:52:26] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86344 and previous config saved to /var/cache/conftool/dbconfig/20251203-055226-ladsgroup.json
[05:52:29] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[05:58:39] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage
[06:05:01] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage
[06:05:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:07:34] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86345 and previous config saved to /var/cache/conftool/dbconfig/20251203-060734-ladsgroup.json
[06:13:29] <logmsgbot>	 ryankemper@cumin2002 reboot-workers (PID 1628404) is awaiting input
[06:15:00] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:15:10] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:15:17] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache
[06:15:33] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[06:16:33] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[06:20:44] <wikibugs>	 (03PS2) 10KartikMistry: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson)
[06:22:42] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86348 and previous config saved to /var/cache/conftool/dbconfig/20251203-062241-ladsgroup.json
[06:23:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426852 (10andrea.denisse)
[06:24:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426856 (10andrea.denisse) Hi  @VPuffetMichel , do you approve this request?
[06:26:32] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS trixie
[06:26:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426858 (10andrea.denisse) 05Open→03In progress Hi @KFrancis, I was unable to find @medelius on the NDA spreadsheet, could you please help me to confirm their NDA status?
[06:27:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426861 (10andrea.denisse)
[06:29:23] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1169 - Depooling db1169
[06:29:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1169 - Depooling db1169
[06:31:37] <wikibugs>	 (03PS1) 10Marostegui: db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214243 (https://phabricator.wikimedia.org/T411498)
[06:32:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214243 (https://phabricator.wikimedia.org/T411498) (owner: 10Marostegui)
[06:35:24] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - Repooling db1169
[06:37:50] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86349 and previous config saved to /var/cache/conftool/dbconfig/20251203-063749-ladsgroup.json
[06:37:53] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[06:38:06] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance
[06:38:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86350 and previous config saved to /var/cache/conftool/dbconfig/20251203-063812-ladsgroup.json
[06:38:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11426873 (10andrea.denisse) 05Open→03In progress >>! In T411436#11425341, @SEgt-WMF wrote: > In case it is useful: the MediaWiki page @Rmaung pointed...
[06:39:42] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1169 gradually with 4 steps - Repooling db1169
[06:40:44] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - Repooling db1169
[06:49:22] <wikibugs>	 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#11426882 (10Marostegui) >>! In T360029#11425775, @Scott_French wrote: > Thanks for the heads-up, @Marostegui.  Thank you for taking a look!  >...
[06:49:52] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11426883 (10andrea.denisse) 05Open→03In progress I wrote to Zoe directly to confirm of this request.
[06:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:56:46] <Amir1>	 !log ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 popups (T406724)
[06:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:50] <stashbot>	 T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0700)
[07:01:07] <Amir1>	 !log ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 rememberpassword (T406724)
[07:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:31] <moritzm>	 !log installing mako security updates
[07:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:01] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[07:17:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:17:53] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:23:14] <wikibugs>	 (03PS1) 10Kosta Harlan: ConfirmEdit: Grant skipcaptcha to bot user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214394 (https://phabricator.wikimedia.org/T411575)
[07:26:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1169 gradually with 4 steps - Repooling db1169
[07:27:43] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:30:45] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11426937 (10MoritzMuehlenhoff)
[07:31:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214396 (https://phabricator.wikimedia.org/T349619)
[07:32:02] <wikibugs>	 (03Abandoned) 10Kosta Harlan: ConfirmEdit: Grant skipcaptcha to bot user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214394 (https://phabricator.wikimedia.org/T411575) (owner: 10Kosta Harlan)
[07:32:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:32:53] <jinxer-wm>	 FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:35:51] <wikibugs>	 (03PS7) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107
[07:35:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler)
[07:36:45] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: extract Lua code for testability (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler)
[07:37:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:38:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398
[07:39:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398
[07:42:43] <jinxer-wm>	 FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:47:20] <moritzm>	 !log installing libtpms security updates
[07:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:43] <jinxer-wm>	 FIRING: [20x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libtpms [puppet] - 10https://gerrit.wikimedia.org/r/1214401
[07:52:43] <jinxer-wm>	 FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:52:53] <jinxer-wm>	 FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:56:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libtpms [puppet] - 10https://gerrit.wikimedia.org/r/1214401 (owner: 10Muehlenhoff)
[07:57:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:57:53] <jinxer-wm>	 FIRING: [13x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:43] <jinxer-wm>	 FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:03:43] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[08:04:00] <wikibugs>	 (03CR) 10Brouberol: [V:03+2] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[08:04:11] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[08:04:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[08:04:33] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[08:04:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon)
[08:04:55] <wikibugs>	 (03Merged) 10jenkins-bot: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze)
[08:05:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:48] <logmsgbot>	 ryankemper@cumin2002 reboot-workers (PID 1628404) is awaiting input
[08:07:43] <jinxer-wm>	 RESOLVED: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:10:43] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "I double checked the IPs discussed in T365259 and this looks good to me, thanks for preparing the patch" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[08:13:11] <wikibugs>	 (03PS8) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107
[08:13:19] <moritzm>	 !log installing python-zipp security updates
[08:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:49] <wikibugs>	 (03PS9) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107
[08:16:56] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092
[08:22:55] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[08:28:49] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh)
[08:32:21] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster
[08:34:13] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 (owner: 10Muehlenhoff)
[08:37:48] <moritzm>	 !log upgrade Envoy on schema* T405808
[08:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:51] <stashbot>	 T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808
[08:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:44:23] <wikibugs>	 (03PS1) 10Awight: Regenerate awight yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1214448
[08:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:00:04] <jouncebot>	 hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0900)
[09:00:54] <logmsgbot>	 !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ganeti-test2001.codfw.wmnet with reason: test CR1207804
[09:11:41] <hashar>	 i am running the train
[09:12:07] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275)
[09:12:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot)
[09:12:59] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot)
[09:14:11] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:14:59] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:15:59] <wikibugs>	 (03PS1) 10Jelto: service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259)
[09:15:59] <wikibugs>	 (03PS1) 10Jelto: conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259)
[09:19:20] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.5  refs T408275
[09:19:24] <stashbot>	 T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275
[09:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:35:11] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[09:44:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: add the conftool module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214460
[09:44:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules: copy over app.generic:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214461
[09:44:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: app.generic: Add conftool volumes and volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214462
[09:44:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: charts/python-webapp: add conftool support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214463
[09:45:13] <wikibugs>	 (03CR) 10JMeybohm: "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 (owner: 10Clément Goubert)
[09:54:35] <wikibugs>	 (03CR) 10MonAx the Developer: "Limit thanks for new users at uk.wikipedia to 3 per hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot)
[09:59:36] <wikibugs>	 (03PS1) 10Fabfur: external_clouds_vendors: added ahrefsbot [puppet] - 10https://gerrit.wikimedia.org/r/1214465
[10:03:37] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui)
[10:06:05] <wikibugs>	 (03PS17) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928
[10:07:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11427333 (10ayounsi) I was looking into that for the LLDP issue, here are some Redfish path that could be useful in that context :  ` >>> spicerack.redfish('ganeti-test2001').mo...
[10:08:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11427337 (10Peachey88)
[10:10:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] external_clouds_vendors: added ahrefsbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur)
[10:10:07] <wikibugs>	 (03CR) 10Blake: [C:03+2] alerting: Update severity of KafkaRollingRestartRequired to Task. [alerts] - 10https://gerrit.wikimedia.org/r/1212599 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake)
[10:11:32] <wikibugs>	 (03PS4) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778)
[10:11:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:04-2] "Please let's start using https://requestctl.wikimedia.org/ipblock_source for this stuff." [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur)
[10:12:17] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578)
[10:12:25] <wikibugs>	 (03PS5) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778)
[10:12:30] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578)
[10:13:14] <kostajh>	 hashar: can I sync a patch? seems like the train deploy finished a while ago
[10:14:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski)
[10:17:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey)
[10:17:57] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055
[10:20:18] <wikibugs>	 (03Abandoned) 10Fabfur: external_clouds_vendors: added ahrefsbot [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur)
[10:24:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan)
[10:24:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan)
[10:25:57] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan)
[10:26:17] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan)
[10:27:03] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]]
[10:27:06] <stashbot>	 T411578: hCaptcha edit attempt logs: Normalize line endings - https://phabricator.wikimedia.org/T411578
[10:29:08] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey)
[10:29:22] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:30:55] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[10:33:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86357 and previous config saved to /var/cache/conftool/dbconfig/20251203-103323-marostegui.json
[10:33:29] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[10:33:29] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[10:34:59] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]] (duration: 07m 56s)
[10:35:02] <stashbot>	 T411578: hCaptcha edit attempt logs: Normalize line endings - https://phabricator.wikimedia.org/T411578
[10:36:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: service::catalog: add 'team' attribute [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807)
[10:38:33] <wikibugs>	 (03PS2) 10Majavah: network: Remove unused cloud_nova_hosts_ranges variable [puppet] - 10https://gerrit.wikimedia.org/r/1214099
[10:38:33] <wikibugs>	 (03PS1) 10Majavah: O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590)
[10:38:35] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590)
[10:39:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[10:40:44] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11427434 (10ayounsi) 05Resolved→03Open Thanks ! Those are still alerting in eqiad :  ge-0/0/0 /dcim/interfaces/37836/ Interface enabled but not connected on fa...
[10:41:46] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590)
[10:48:08] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:48:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86358 and previous config saved to /var/cache/conftool/dbconfig/20251203-104830-marostegui.json
[10:49:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: multi-team ProbeDown [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807)
[10:53:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[10:53:13] <wikibugs>	 (03PS2) 10Esanders: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552)
[10:53:36] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host sretest2001
[10:53:38] <wikibugs>	 (03PS3) 10Esanders: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552)
[10:54:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders)
[10:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:58:37] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2001
[10:59:31] <wikibugs>	 (03PS2) 10Majavah: O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590)
[10:59:31] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1100)
[11:03:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86359 and previous config saved to /var/cache/conftool/dbconfig/20251203-110338-marostegui.json
[11:06:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[11:06:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[11:07:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[11:07:45] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host ml-serve1013
[11:08:09] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#11427590 (10JMonton-WMF) Another option: I worked in the past with https://github.com/devshawn/kafka-gitops to manage all topic set...
[11:10:20] <icinga-wm>	 PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100%
[11:10:58] <icinga-wm>	 RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[11:12:05] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudvps_meta: Set creationTime [puppet] - 10https://gerrit.wikimedia.org/r/1214481 (https://phabricator.wikimedia.org/T411590)
[11:12:48] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve1013
[11:13:56] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::cloudvps_meta: Set creationTime [puppet] - 10https://gerrit.wikimedia.org/r/1214481 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah)
[11:15:08] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[11:15:20] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[11:18:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86360 and previous config saved to /var/cache/conftool/dbconfig/20251203-111846-marostegui.json
[11:18:51] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[11:18:51] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[11:19:03] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance
[11:19:11] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86361 and previous config saved to /var/cache/conftool/dbconfig/20251203-111910-marostegui.json
[11:23:46] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86362 and previous config saved to /var/cache/conftool/dbconfig/20251203-112345-marostegui.json
[11:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:31:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:32:14] <wikibugs>	 (03Abandoned) 10Awight: Monitoring for WMDE dumps scraper [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight)
[11:32:18] <wikibugs>	 (03CR) 10Awight: "Thanks for the tip, that's exactly what we'll do!  Pushgateway is also helpful for caching results after the run is complete." [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight)
[11:34:17] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:35:41] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486
[11:38:02] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:38:54] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86363 and previous config saved to /var/cache/conftool/dbconfig/20251203-113853-marostegui.json
[11:39:17] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:42] <wikibugs>	 (03PS10) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107
[11:40:59] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092
[11:41:11] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092
[11:41:22] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486
[11:41:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:42:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:43:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:44:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:45:53] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488
[11:46:58] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[11:48:08] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:48:42] <wikibugs>	 (03PS1) 10Tchanders: WIP Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489
[11:50:31] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler)
[11:50:58] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219)
[11:51:09] <wikibugs>	 (03PS1) 10Majavah: openstack: puppet: Remove support for X-Enc-Edit-Git [puppet] - 10https://gerrit.wikimedia.org/r/1214490
[11:51:09] <wikibugs>	 (03PS1) 10Majavah: openstack: puppet: Do not commit empty role fiels [puppet] - 10https://gerrit.wikimedia.org/r/1214491
[11:54:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86364 and previous config saved to /var/cache/conftool/dbconfig/20251203-115401-marostegui.json
[11:55:00] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:55:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler)
[11:58:02] <jinxer-wm>	 FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[11:59:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler)
[11:59:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:00:05] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1200).
[12:03:08] <wikibugs>	 (03PS1) 10Btullis: Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833)
[12:04:17] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:05:26] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[12:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612 (10thiemowmde) 03NEW
[12:06:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[12:08:02] <jinxer-wm>	 FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:08:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11427904 (10thiemowmde)
[12:08:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11427905 (10Tobi_WMDE_SW) As the Engineering Manager of the team Thiemo works on, I support this request.
[12:09:00] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610)
[12:09:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86365 and previous config saved to /var/cache/conftool/dbconfig/20251203-120909-marostegui.json
[12:09:14] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[12:09:14] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[12:09:17] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:09:25] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance
[12:09:32] <wikibugs>	 (03PS1) 10DDesouza: Increase coverage of 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918)
[12:09:33] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86366 and previous config saved to /var/cache/conftool/dbconfig/20251203-120933-marostegui.json
[12:10:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7777/co" [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah)
[12:10:06] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:openldap extend wikimediaPerson schema for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede)
[12:10:22] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:openldap extend wikimediaPerson schema for Phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede)
[12:10:45] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486
[12:11:34] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler)
[12:11:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove mediawiki-testers group [puppet] - 10https://gerrit.wikimedia.org/r/1214110 (owner: 10Muehlenhoff)
[12:11:38] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler)
[12:12:03] <wikibugs>	 (03PS2) 10Majavah: P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610)
[12:12:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7778/co" [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah)
[12:13:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler)
[12:13:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1214169 (owner: 10JHathaway)
[12:13:20] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler)
[12:13:33] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler)
[12:13:34] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler)
[12:14:09] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86367 and previous config saved to /var/cache/conftool/dbconfig/20251203-121409-marostegui.json
[12:14:14] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[12:14:15] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[12:14:17] <jinxer-wm>	 FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:37] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi)
[12:15:21] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler)
[12:15:37] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi)
[12:17:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "This looks good, but marking as -1 until the preconditions are resolved (puppet base classes, d-i, etc)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey)
[12:17:47] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:18:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:18:34] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:19:17] <jinxer-wm>	 FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:19:30] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:19:37] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[12:19:45] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[12:20:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 (owner: 10Muehlenhoff)
[12:20:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) (owner: 10Tiziano Fogli)
[12:20:38] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:23:02] <jinxer-wm>	 FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:24:17] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:26:00] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:26:00] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah)
[12:26:19] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah)
[12:26:40] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:28:02] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:29:13] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86368 and previous config saved to /var/cache/conftool/dbconfig/20251203-122912-ladsgroup.json
[12:29:16] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[12:29:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:29:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86369 and previous config saved to /var/cache/conftool/dbconfig/20251203-122923-marostegui.json
[12:30:08] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:30:35] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:32:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617 (10cmooney) 03NEW p:05Triage→03Medium
[12:32:14] <claime>	 !log Restarting failed timer dump_cloud_ip_ranges on puppetservers
[12:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617#11428021 (10cmooney)
[12:34:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:35:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:37:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:grafana: Default to UTC timezone [puppet] - 10https://gerrit.wikimedia.org/r/1213506 (https://phabricator.wikimedia.org/T411274) (owner: 10Majavah)
[12:38:02] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:38:42] <wikibugs>	 (03PS2) 10Tchanders: WIP Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618)
[12:39:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:40:29] <wikibugs>	 (03PS3) 10Daniel Kinzler: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560
[12:40:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler)
[12:41:32] <wikibugs>	 (03PS4) 10Daniel Kinzler: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560
[12:44:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:44:20] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86370 and previous config saved to /var/cache/conftool/dbconfig/20251203-124419-ladsgroup.json
[12:44:31] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86371 and previous config saved to /var/cache/conftool/dbconfig/20251203-124430-marostegui.json
[12:45:39] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "That will add this information to the api-gateway's logs as well (which is fine). I can deploy the api-gateway, since it has quite a few u" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler)
[12:46:28] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler)
[12:48:19] <wikibugs>	 (03Merged) 10jenkins-bot: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler)
[12:49:17] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:49:34] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:52] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:50:16] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:50:23] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:50:28] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:51:25] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:51:52] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[12:52:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[12:52:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11428100 (10MoritzMuehlenhoff)
[12:52:33] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:53:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[12:54:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[12:54:17] <jinxer-wm>	 FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:55:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[12:56:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[12:56:28] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:57:01] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:58:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:59:17] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:59:28] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86372 and previous config saved to /var/cache/conftool/dbconfig/20251203-125927-ladsgroup.json
[12:59:39] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86373 and previous config saved to /var/cache/conftool/dbconfig/20251203-125938-marostegui.json
[12:59:43] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[12:59:44] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[12:59:55] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance
[13:00:02] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86374 and previous config saved to /var/cache/conftool/dbconfig/20251203-130002-marostegui.json
[13:00:11] <wikibugs>	 (03PS2) 10Muehlenhoff: pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104
[13:00:31] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[13:00:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[13:00:49] <wikibugs>	 (03PS1) 10Jelto: devtools hiera: set gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742)
[13:01:21] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[13:02:06] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:02:06] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:03:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:17] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:04:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86375 and previous config saved to /var/cache/conftool/dbconfig/20251203-130437-marostegui.json
[13:06:03] <wikibugs>	 (03PS3) 10KartikMistry: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson)
[13:06:57] <wikibugs>	 (03PS1) 10Majavah: interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508
[13:06:57] <wikibugs>	 (03PS1) 10Majavah: P:kubernetes: deployment_server: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214509
[13:08:02] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:08:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624 (10Rsilvola) 03NEW
[13:08:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:09:17] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:10:00] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7780/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214509 (owner: 10Majavah)
[13:10:00] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:11:21] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7779/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah)
[13:11:56] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:11:56] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:13:02] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:14:17] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:14:36] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86376 and previous config saved to /var/cache/conftool/dbconfig/20251203-131435-ladsgroup.json
[13:14:39] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[13:14:40] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2229.codfw.wmnet with reason: Maintenance
[13:14:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86377 and previous config saved to /var/cache/conftool/dbconfig/20251203-131448-ladsgroup.json
[13:16:32] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: rsync logic extraction from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833)
[13:16:32] <wikibugs>	 (03CR) 10Arnaudb: "This change will allow running the file transfer logic between gerrit instances. It will also simplify double checking transfer between ho" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:16:50] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson)
[13:18:02] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[13:18:29] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "You don't need this anymore, we fixed the network MTU instead" [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto)
[13:18:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson)
[13:19:17] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:19:45] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86378 and previous config saved to /var/cache/conftool/dbconfig/20251203-131945-marostegui.json
[13:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:20:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[13:22:42] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:24:47] <wikibugs>	 (03CR) 10Jelto: "that's great news! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto)
[13:24:53] <wikibugs>	 (03Abandoned) 10Jelto: devtools hiera: set gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto)
[13:25:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:28:52] <wikibugs>	 (03PS1) 10Jelto: gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742)
[13:30:09] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:31:02] <wikibugs>	 (03CR) 10Majavah: [C:03+1] gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto)
[13:32:14] <kart_>	 !log Updated Recommendation API to 2025-12-02-200719-production (T408845, T408844, T384485)
[13:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:21] <stashbot>	 T408845: Visual indicator that an article in a list is part of a nominated collection - https://phabricator.wikimedia.org/T408845
[13:32:21] <stashbot>	 T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844
[13:32:21] <stashbot>	 T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485
[13:34:53] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86379 and previous config saved to /var/cache/conftool/dbconfig/20251203-133452-marostegui.json
[13:35:11] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[13:35:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:45:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:45:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto)
[13:46:49] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605)
[13:46:53] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336)
[13:50:01] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86380 and previous config saved to /var/cache/conftool/dbconfig/20251203-135000-marostegui.json
[13:50:05] <stashbot>	 T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163
[13:50:06] <stashbot>	 T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164
[13:50:18] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance
[13:51:34] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:56:24] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, one comment in line. It might makes sense to create a more generic alert somewhere in https://gerrit.wikimedia.org/r/plugins/gitiles" [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth)
[13:58:20] <wikibugs>	 06SRE, 06collaboration-services, 10vrts, 10Znuny, and 2 others: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11428300 (10Jelto) The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&timezone...
[13:58:28] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&tim" [puppet] - 10https://gerrit.wikimedia.org/r/1214129 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth)
[13:58:36] <wikibugs>	 (03PS14) 10Arnaudb: gerrit: rsync logic extraction from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833)
[13:58:36] <wikibugs>	 (03CR) 10Arnaudb: "see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214466/comments/51d75323_218ff349" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1400).
[14:00:05] <jouncebot>	 stephanebisson and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:18] <stephanebisson>	 o/
[14:00:43] <Lucas_WMDE>	 o/
[14:00:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson)
[14:00:51] <stephanebisson>	 I can start
[14:00:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:00:58] <edsanders>	 o/
[14:01:01] <Lucas_WMDE>	 ok!
[14:02:06] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:02:06] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:05:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:09:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:11:56] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:11:56] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: add files for configuring conftool [puppet] - 10https://gerrit.wikimedia.org/r/1214524
[14:14:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11428392 (10Jclark-ctr) @BTullis   both drives have been replaced
[14:14:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes::deployment_server: add files for configuring conftool [puppet] - 10https://gerrit.wikimedia.org/r/1214524 (owner: 10Giuseppe Lavagetto)
[14:14:33] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251201 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson)
[14:15:05] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]]
[14:15:10] <stashbot>	 T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842
[14:15:11] <stashbot>	 T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844
[14:16:37] <wikibugs>	 (03CR) 10FNegri: [C:03+1] interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah)
[14:16:44] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah)
[14:17:16] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:21:56] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Continuing with sync
[14:25:06] <wikibugs>	 (03PS1) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214526 (https://phabricator.wikimedia.org/T409528)
[14:27:06] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]] (duration: 12m 01s)
[14:27:10] <stashbot>	 T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842
[14:27:11] <stashbot>	 T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844
[14:27:27] <XioNoX>	 !log push pfw policies - T411566
[14:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders)
[14:28:36] <wikibugs>	 (03Merged) 10jenkins-bot: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders)
[14:29:09] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]]
[14:29:12] <stashbot>	 T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552
[14:31:25] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:31:36] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi)
[14:32:04] <wikibugs>	 (03CR) 10Majavah: firewall: Use virtual resources to fix ordering issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[14:32:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11428437 (10IBerker-WMF) I approve.
[14:33:41] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:34:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders)
[14:35:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders)
[14:38:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi)
[14:38:53] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]] (duration: 09m 44s)
[14:38:57] <stashbot>	 T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552
[14:39:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 21.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:40:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders)
[14:40:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders)
[14:41:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:41:47] <wikibugs>	 (03Merged) 10jenkins-bot: DiscussionTools: cleanup unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders)
[14:41:50] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgVisualEditorEditCheckSingleCheckMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders)
[14:42:19] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]]
[14:44:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 21.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:44:38] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:45:04] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:46:17] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:17] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:49:03] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]] (duration: 06m 44s)
[14:50:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good feature-wise. We might need some fine-tuning for the dialogue in terms of clarity for the user, but we can figure that out when" [software/bitu] - 10https://gerrit.wikimedia.org/r/1196919 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede)
[14:51:59] <wikibugs>	 (03CR) 10Reedy: "Seems to have caused T411632" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle)
[14:53:17] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:54:46] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:00] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:55:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:56:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:58:41] <wikibugs>	 (03PS1) 10Krinkle: robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632)
[14:58:47] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] admin: add fido backed ssh keys for jhathaway [puppet] - 10https://gerrit.wikimedia.org/r/1214169 (owner: 10JHathaway)
[15:00:00] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:00:03] <robh>	 !log alert1002 port migration now starting
[15:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500)
[15:00:16] <wikibugs>	 (03PS6) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778)
[15:00:17] <wikibugs>	 (03PS1) 10Slyngshede: C:mtail backend requests ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584)
[15:00:30] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on alert1002.wikimedia.org with reason: C/D Migration
[15:00:58] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Oops. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle)
[15:02:32] <wikibugs>	 (03PS2) 10Klausman: installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778)
[15:02:59] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605) (owner: 10Jforrester)
[15:03:30] <wikibugs>	 (03PS1) 10Ayounsi: Tox: remove old python support [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532
[15:03:40] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync
[15:03:48] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) (owner: 10Klausman)
[15:03:57] <wikibugs>	 (03CR) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi)
[15:04:20] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync
[15:04:20] <wikibugs>	 (03PS4) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337
[15:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605) (owner: 10Jforrester)
[15:06:03] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:06:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11428624 (10RobH) 05Open→03Resolved This migration was completed just know with no issues.  Thanks to both @Jclark-ctr and @herron for the on-site part and the icin...
[15:06:40] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance
[15:06:51] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:07:03] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:08:08] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:08:17] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:09:04] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:09:40] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336) (owner: 10Jforrester)
[15:10:00] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:50] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336) (owner: 10Jforrester)
[15:12:22] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:12:31] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11428643 (10Jclark-ctr) 05Open→03Resolved Resolved the remaining.
[15:12:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11428647 (10Jclark-ctr)
[15:12:42] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:13:57] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:14:27] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:15:10] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:15:42] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:16:14] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync
[15:16:37] <Amir1>	 jouncebot: nowandnext
[15:16:37] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500)
[15:16:37] <jouncebot>	 In 0 hour(s) and 13 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1530)
[15:16:52] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync
[15:16:54] <James_F>	 Amir1: We're not MW-facing, go for it on your end.
[15:17:36] <Amir1>	 <3
[15:17:47] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup)
[15:17:48] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977)
[15:17:53] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester)
[15:17:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway)
[15:18:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup)
[15:18:54] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup)
[15:19:26] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]]
[15:19:29] <stashbot>	 T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088
[15:19:30] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111)
[15:19:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester)
[15:20:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway)
[15:20:38] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:20:52] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:21:08] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:21:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11428720 (10RobH) Day 12 Update (in progress, will edit as day progresses):  * alert1002 migration complete * 306 of 308 hosts migrated. * lvs1019 will migrat...
[15:21:41] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:21:54] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:21:54] <wikibugs>	 (03CR) 10Elukey: [C:03+1] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway)
[15:22:08] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:23:07] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:23:07] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[15:23:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway)
[15:23:40] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111) (owner: 10Jforrester)
[15:24:23] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway)
[15:24:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway)
[15:24:52] <wikibugs>	 (03PS1) 10Ayounsi: inter.link: add DDoS scrubbing community to all v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1214537 (https://phabricator.wikimedia.org/T407959)
[15:25:23] <wikibugs>	 (03PS2) 10Ayounsi: inter.link: add DDoS scrubbing community to all v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1214537 (https://phabricator.wikimedia.org/T407959)
[15:25:33] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111) (owner: 10Jforrester)
[15:26:08] <wikibugs>	 (03PS2) 10JHathaway: reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152
[15:26:22] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:26:41] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:26:43] <wikibugs>	 (03CR) 10JHathaway: [V:03+2 C:03+2] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway)
[15:26:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf2006.codfw.wmnet
[15:27:00] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:27:04] <wikibugs>	 (03PS3) 10Klausman: installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778)
[15:27:13] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]] (duration: 07m 48s)
[15:27:16] <stashbot>	 T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088
[15:27:35] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:27:43] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:27:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch conf2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214396 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:28:15] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:30:06] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500)
[15:30:06] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1530)
[15:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:32:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf2006.codfw.wmnet
[15:35:00] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza)
[15:43:35] <wikibugs>	 (03CR) 10Elukey: [C:03+1] UEFI: dup partition on MD RAID boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[15:47:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I am fine with this, we could even think about replacing py39/310 with 312/313 and see how it goes, to be more future proof. It can be don" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532 (owner: 10Ayounsi)
[15:50:08] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[15:55:03] <wikibugs>	 (03PS4) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102)
[15:55:03] <wikibugs>	 (03PS5) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650
[15:55:03] <wikibugs>	 (03PS5) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089)
[15:55:04] <wikibugs>	 (03PS5) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089)
[15:55:05] <wikibugs>	 (03PS1) 10Majavah: ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549
[15:56:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf2005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214550 (https://phabricator.wikimedia.org/T349619)
[15:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dup-uefi.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:00:10] <wikibugs>	 (03PS5) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102)
[16:00:10] <wikibugs>	 (03PS6) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650
[16:00:10] <wikibugs>	 (03PS2) 10Majavah: ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549
[16:00:11] <wikibugs>	 (03PS6) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089)
[16:00:12] <wikibugs>	 (03PS6) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089)
[16:03:16] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[16:03:16] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[16:03:26] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[16:04:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214553 (https://phabricator.wikimedia.org/T349619)
[16:05:06] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Thanos
[16:05:10] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1006 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Thanos
[16:05:40] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11429011 (10Mvolz) >>! In T345627#11409512, @elukey wrote: > @Mvolz all merged, the new dashbo...
[16:06:17] <wikibugs>	 (03CR) 10Majavah: nftables::service: Improve src/dst filter handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[16:07:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1214556 (https://phabricator.wikimedia.org/T311407)
[16:07:59] <wikibugs>	 (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[16:09:00] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[16:09:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dup-uefi.service on cirrussearch1124:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:09:40] <dancy>	 jouncebot nowandnext
[16:09:40] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 50 minute(s)
[16:09:40] <jouncebot>	 In 1 hour(s) and 50 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[16:09:40] <jouncebot>	 In 1 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[16:09:59] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.229.0" for 164 host(s)
[16:10:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214557 (https://phabricator.wikimedia.org/T349619)
[16:10:48] <wikibugs>	 (03CR) 10Majavah: service::catalog: add gerrit-https and gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[16:11:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214558 (https://phabricator.wikimedia.org/T349619)
[16:12:41] <wikibugs>	 (03PS7) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549
[16:13:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch conf1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214561 (https://phabricator.wikimedia.org/T349619)
[16:13:54] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.229.0" completed for 164 hosts
[16:15:09] <topranks>	 !log disabling unused former cloudcephosd hosts on cloud switches T410989
[16:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:12] <stashbot>	 T410989: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989
[16:16:00] <wikibugs>	 (03PS1) 10Bking: opensearch-operator: push dummy chart update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956)
[16:19:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: dup-uefi.service on cirrussearch1119:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:19:37] <wikibugs>	 (03PS1) 10JHathaway: UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949)
[16:19:57] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[16:20:24] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429115 (10cmooney) a:05cmooney→03None Ok I've disabled all the unused ports on the cloud switches now.  The one exception is for cloudcepho...
[16:21:12] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429120 (10cmooney) DC-Ops folks we can now remove these superflous cables from the racks, and once removed delete the cable in Netbox too.  Thi...
[16:21:22] <wikibugs>	 (03CR) 10CDanis: service::catalog: add gerrit-https and gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[16:22:23] <wikibugs>	 (03PS2) 10JHathaway: UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949)
[16:22:27] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[16:24:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Only select Puppet version based on the Debian release (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798)
[16:24:26] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: unmask service & disable backup temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833)
[16:24:26] <wikibugs>	 (03CR) 10Arnaudb: "this change and the next one are designed to be merged after puppet is disabled on all Gerrit instances." [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[16:25:37] <wikibugs>	 (03CR) 10Vgutierrez: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins)
[16:26:15] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470)
[16:26:15] <wikibugs>	 (03CR) 10Arnaudb: "this change is designed to be merged after puppet is disabled on all Gerrit instances." [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[16:26:25] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469)
[16:27:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm)
[16:27:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[16:27:20] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833)
[16:27:20] <wikibugs>	 (03CR) 10Arnaudb: "this change is designed to be merged once the switchover is done. It will enable backups again on what will then be the primary instance." [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[16:31:01] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469)
[16:31:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm)
[16:32:55] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469)
[16:33:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm)
[16:35:25] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469)
[16:37:30] <wikibugs>	 (03PS2) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615
[16:37:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411195#11429197 (10Jhancock.wm) 05Open→03Resolved all ports verified empty and removed from netbox
[16:38:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff)
[16:38:57] <bd808>	 jouncebot: nowandnext
[16:38:57] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 21 minute(s)
[16:38:57] <jouncebot>	 In 1 hour(s) and 21 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[16:38:57] <jouncebot>	 In 1 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[16:39:11] <wikibugs>	 (03PS3) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615
[16:39:17] <wikibugs>	 (03CR) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff)
[16:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:40:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211191 (owner: 10BryanDavis)
[16:40:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211192 (owner: 10BryanDavis)
[16:41:21] <wikibugs>	 (03Merged) 10jenkins-bot: officewiki: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211191 (owner: 10BryanDavis)
[16:41:23] <wikibugs>	 (03Merged) 10jenkins-bot: officewiki: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211192 (owner: 10BryanDavis)
[16:41:52] <vgutierrez>	 sukhe: ^^ any recent changes to druid-public-coordinator?
[16:41:54] <logmsgbot>	 !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]]
[16:42:02] <vgutierrez>	 /cc btullis 
[16:42:12] <sukhe>	 vgutierrez: nope. the last patch attempt failed so we reverted it
[16:42:31] <sukhe>	 will check after the meeting
[16:42:36] <vgutierrez>	 did you clear the alerts after that?
[16:44:00] <sukhe>	 I am not sure if this is related to that so I will need to check
[16:44:02] <sukhe>	 will do so
[16:44:06] <sukhe>	 but no, I did not clear the alerts
[16:44:22] <logmsgbot>	 !log bd808@deploy2002 bd808: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:45:03] <vgutierrez>	 yeah.. that alert is 16 days old
[16:45:38] <logmsgbot>	 !log bd808@deploy2002 bd808: Continuing with sync
[16:46:07] <vgutierrez>	 but it's still being triggered at the moment
[16:46:09] <vgutierrez>	 025-12-03T16:45:21.997079+00:00 config-master2001 confd[88705]: 2025-12-03T16:45:21Z config-master2001 /usr/bin/confd[88705]: ERROR "failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/eqiad/.druid-public-coordinator1210973826' with 1 (0.02312159538269043s) [invalid]: server pool cannot be empty!\n\nupdating error mtime on 
[16:46:09] <vgutierrez>	 /var/run/confd-template/_srv_config-master_pybal_eqiad_druid-public-coordinator.err\n"
[16:47:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648 (10BTullis) 03NEW
[16:47:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429236 (10BTullis) a:03BTullis
[16:48:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429244 (10Ahoelzl) Approved.
[16:49:41] <logmsgbot>	 !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]] (duration: 07m 47s)
[16:49:51] <wikibugs>	 (03PS1) 10Btullis: Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648)
[16:49:55] <jinxer-wm>	 RESOLVED: [5x] SystemdUnitFailed: dup-uefi.service on cirrussearch1119:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:51:16] <wikibugs>	 (03PS1) 10Btullis: Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648)
[16:53:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle)
[16:53:18] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429266 (10BTullis)
[16:53:33] <bd808>	 Krinkle: ^^ sending that robots.txt fix out
[16:53:55] <wikibugs>	 (03CR) 10Ahoelzl: [V:03+1] Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis)
[16:54:08] <wikibugs>	 (03Merged) 10jenkins-bot: robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle)
[16:54:09] <wikibugs>	 (03CR) 10Ahoelzl: [V:03+1] Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis)
[16:54:40] <logmsgbot>	 !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]]
[16:54:43] <stashbot>	 T411632: PHP Warning: Undefined array key "enabled" - https://phabricator.wikimedia.org/T411632
[16:55:11] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis)
[16:55:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis)
[16:56:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429277 (10BTullis) 05Open→03Resolved
[16:57:07] <logmsgbot>	 !log bd808@deploy2002 bd808, krinkle: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:58:17] <logmsgbot>	 !log bd808@deploy2002 bd808, krinkle: Continuing with sync
[16:58:24] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429297 (10Jhancock.wm) the four servers in codfw have had cables physically removed and deleted in netbox.
[17:01:23] <Krinkle>	 bd808: thx, want me to test or are you?
[17:02:20] <logmsgbot>	 !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]] (duration: 07m 40s)
[17:02:23] <stashbot>	 T411632: PHP Warning: Undefined array key "enabled" - https://phabricator.wikimedia.org/T411632
[17:02:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11429343 (10Jhancock.wm) @Dwisehaupt two network connections have now been provisioned. lmk if you need anything else =)
[17:02:41] <bd808>	 Krinkle: I did a quick test that robots.txt still rendered before I sent it the rest of the way, but if you can watch to make sure the error stops that would be swell.
[17:02:56] <Krinkle>	 yep, error is gone on https://en.wikinews.org/robots.txt?_1235 after the change on mwdebug
[17:03:03] <Krinkle>	 no more logspam
[17:04:07] <icinga-wm>	 PROBLEM - Host db1229 #page is DOWN: PING CRITICAL - Packet loss = 100%
[17:05:36] <jynus>	 checking
[17:06:01] <icinga-wm>	 RECOVERY - Host db1229 #page is UP: PING WARNING - Packet loss = 50%, RTA = 284.91 ms
[17:06:33] <jynus>	 did it crash?
[17:06:38] <moritzm>	 hmmh, can't reach the host even on the mgmt
[17:06:42] <jynus>	 I can
[17:06:47] <jynus>	 it rebooted
[17:06:56] <icinga-wm>	 PROBLEM - MariaDB read only s2 on db1229 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[17:06:59] <jynus>	 let's depool
[17:07:07] <wikibugs>	 (03Abandoned) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm)
[17:07:08] <jynus>	 prob hw issue
[17:07:12] <wikibugs>	 (03PS4) 10Urbanecm: [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469)
[17:07:42] <wikibugs>	 (03CR) 10Urbanecm: "> https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-diffConfig/76/console : FAILURE No change detected against " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm)
[17:07:46] <logmsgbot>	 !log jynus@cumin1003 dbctl commit (dc=all): 'Depooldb1229', diff saved to https://phabricator.wikimedia.org/P86383 and previous config saved to /var/cache/conftool/dbconfig/20251203-170745-jynus.json
[17:07:56] <icinga-wm>	 PROBLEM - mysqld processes on db1229 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[17:07:57] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 #page on db1229 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:07:58] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s2 #page on db1229 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:08:09] <marostegui>	 mmm
[17:08:11] <jynus>	 I will ack or downtime
[17:08:17] <jynus>	 looks like a hw crash
[17:08:21] <marostegui>	 yeah
[17:08:23] <jynus>	 and file a ticket
[17:08:33] <wikibugs>	 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11429363 (10elukey) @Mvolz ahhh ok thanks for the explanation! I rechecked the graph and it sh...
[17:08:37] <marostegui>	 I am depooling
[17:09:05] <jynus>	 I did it already
[17:09:07] <jynus>	 see backlog
[17:09:09] <marostegui>	 oh!
[17:09:10] <marostegui>	 thanks!
[17:10:38] <moritzm>	 there's a broken DIMM at B7
[17:10:49] <moritzm>	 I'll open a task for DC ops to get it swapped
[17:11:01] <marostegui>	 thank you moritzm tag DBA if you can!
[17:11:05] <jynus>	 please add the info to 411652
[17:11:10] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1229.eqiad.wmnet with reason: crashed
[17:11:13] <jynus>	 moritzm:  ^
[17:11:23] <moritzm>	 ah, thx
[17:11:51] <jynus>	 T411652
[17:11:51] <stashbot>	 T411652: db1229 crashed - https://phabricator.wikimedia.org/T411652
[17:11:52] <wikibugs>	 (03PS1) 10Sbisson: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485)
[17:13:12] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429385 (10MoritzMuehlenhoff)
[17:14:17] <stephanebisson>	 jouncebot now
[17:14:17] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 45 minute(s)
[17:14:22] <stephanebisson>	 jouncebot next
[17:14:22] <jouncebot>	 In 0 hour(s) and 45 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[17:14:22] <jouncebot>	 In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[17:15:01] <stephanebisson>	 Hi, any chance I can do an emergency backport for Content Translation right now?
[17:15:23] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429388 (10jcrespo)
[17:15:45] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11429389 (10ssingh) >>! In T408892#11426444, @Papaul wrote: > @ssingh yes we have to depool the site, yes 10 AM CT  Thanks, that works. Will send an invite.
[17:16:40] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429390 (10jcrespo)
[17:16:54] <wikibugs>	 (03PS1) 10Marostegui: db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214581 (https://phabricator.wikimedia.org/T411652)
[17:17:01] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429391 (10cmooney) >>! In T410989#11429297, @Jhancock.wm wrote: > the four servers in codfw have had cables physically removed and deleted in n...
[17:17:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214581 (https://phabricator.wikimedia.org/T411652) (owner: 10Marostegui)
[17:17:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429394 (10cmooney)
[17:17:48] <wikibugs>	 (03PS5) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664)
[17:18:39] <wikibugs>	 (03CR) 10Mstyles: OATHAuth: Expand 2FA to all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[17:19:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson)
[17:19:26] <moritzm>	 given db1229 is depooled, DC ops are looped in for the eventual hardware fix and notifications are now disabled, I'd resolve the page
[17:20:09] <wikibugs>	 06SRE, 06collaboration-services, 10vrts, 10Znuny, and 2 others: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11429403 (10Dzahn) +1 - it seems the cleanup job is needed
[17:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:23:09] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks for the additional discussion, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[17:25:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:25:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:26:40] <jinxer-wm>	 FIRING: ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#commons.wikimedia.org:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:26:42] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:26:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:28:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:29:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:29:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:29:43] <wikibugs>	 (03CR) 10Dzahn: "Thank you for the reviews. I like getting everything merged that is possible to merge without harm. I actually see a benefit in getting pa" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[17:29:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:30:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp2027.codfw.wmnet, cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2027.codfw.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:30:16] <icinga-wm>	 PROBLEM - SSH on lvs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:30:58] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[17:31:08] <icinga-wm>	 RECOVERY - SSH on lvs1016 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:31:09] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson)
[17:31:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:31:40] <jinxer-wm>	 RESOLVED: ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#commons.wikimedia.org:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:31:41] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]]
[17:32:01] <stashbot>	 T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485
[17:32:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:33:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:33:54] <icinga-wm>	 RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:34:10] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:34:36] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:35:11] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[17:35:19] <AntiComposite>	 Multiple reports of enwiki being unreachable, various errors
[17:35:28] <sukhe>	 AntiComposite: thanks, known.
[17:35:58] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[17:36:44] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Continuing with sync
[17:37:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr2-magru.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[17:39:53] <jinxer-wm>	 FIRING: [2x] DDoSDetected: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected
[17:40:48] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]] (duration: 09m 07s)
[17:40:51] <stashbot>	 T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485
[17:40:54] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS trixie
[17:41:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:42:46] <jinxer-wm>	 FIRING: [2x] Primary inbound port utilisation over 80%  #page: Device cr2-magru.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[17:43:16] <wikibugs>	 (03PS1) 10Cparle: Feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908)
[17:44:34] <wikibugs>	 (03PS1) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[17:44:53] <jinxer-wm>	 FIRING: [6x] DDoSDetected: FastNetMon has detected an attack on eqiad #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected
[17:45:58] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.network.cf
[17:46:00] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[17:46:08] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.network.cf
[17:46:09] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[17:46:12] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.network.cf
[17:46:19] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.network.cf (exit_code=1)
[17:46:29] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.network.cf
[17:46:30] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[17:46:40] <wikibugs>	 (03PS2) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[17:46:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:46:54] <wikibugs>	 (03PS3) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[17:47:17] <wikibugs>	 (03PS1) 10Majavah: Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586
[17:47:46] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[17:47:46] <wikibugs>	 (03PS1) 10Ssingh: sites: set prepend_as_out to true [homer/public] - 10https://gerrit.wikimedia.org/r/1214587
[17:48:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah)
[17:48:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah)
[17:48:23] <wikibugs>	 (03Abandoned) 10Ssingh: sites: set prepend_as_out to true [homer/public] - 10https://gerrit.wikimedia.org/r/1214587 (owner: 10Ssingh)
[17:48:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah)
[17:49:53] <jinxer-wm>	 RESOLVED: [6x] DDoSDetected: FastNetMon has detected an attack on eqiad #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected
[17:50:47] <wikibugs>	 (03PS1) 10Ladsgroup: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953)
[17:50:58] <wikibugs>	 (03PS1) 10Ladsgroup: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953)
[17:50:59] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS trixie
[17:51:06] <Amir1>	 jouncebot: nowandnexr
[17:51:07] <Amir1>	 jouncebot: nowandnext
[17:51:08] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 8 minute(s)
[17:51:08] <jouncebot>	 In 0 hour(s) and 8 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[17:51:08] <jouncebot>	 In 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[17:54:50] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:55:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[17:57:03] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[18:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:01:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage
[18:01:20] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[18:01:42] <James_F>	 jouncebot: refresh
[18:01:42] <jouncebot>	 I refreshed my knowledge about deployments.
[18:01:48] <James_F>	 jouncebot: nowandnext
[18:01:48] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:01:48] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:01:48] <jouncebot>	 In 2 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100)
[18:01:53] <James_F>	 Whut.
[18:02:04] <Amir1>	 :D
[18:02:12] <James_F>	 That's wrong.
[18:02:20] <James_F>	 "Wikifunctions Services UTC Afternoon" was three hours ago.
[18:03:11] <James_F>	 The time stamps on the wiki look right?
[18:03:16] <James_F>	 Oh!
[18:04:11] <James_F>	 jouncebot: refresh
[18:04:12] <jouncebot>	 I refreshed my knowledge about deployments.
[18:04:16] <James_F>	 jouncebot: nowandnext
[18:04:16] <jouncebot>	 For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:04:16] <jouncebot>	 In 2 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100)
[18:04:19] <James_F>	 Better.
[18:04:58] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating for cloudceph to codfw - jhancock@cumin1003"
[18:05:02] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating for cloudceph to codfw - jhancock@cumin1003"
[18:05:02] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:05:10] <Amir1>	 Is anything happening on the infra window?
[18:05:17] <Amir1>	 It doesn't look like it :D
[18:05:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[18:05:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[18:08:44] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage
[18:10:55] <wikibugs>	 (03Merged) 10jenkins-bot: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[18:11:00] <wikibugs>	 (03Merged) 10jenkins-bot: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[18:12:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage
[18:17:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429625 (10Jclark-ctr) @MoritzMuehlenhoff Drive has been Replaced
[18:19:54] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]]
[18:19:57] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[18:20:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429645 (10Jclark-ctr) ` NAME               SIZE MODEL                     SERIAL       PATH sda              894.3G Micron_5400_MTFDDAK960TGA 24144807E580 /dev/sda ├─sda1...
[18:22:14] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:22:37] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[18:24:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429667 (10Jclark-ctr) ` [Wed Dec  3 18:16:03 2025] ata7.00: detaching (SCSI 6:0:0:0) [Wed Dec  3 18:16:03 2025] sd 6:0:0:0: [sdb] Synchronizing SCSI cache [Wed Dec  3 18:16:03 2025] sd 6:0:0:0: [sdb] Synch...
[18:24:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429670 (10Jclark-ctr) a:03Jclark-ctr
[18:25:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11429672 (10KFrancis) Hi @andrea.denisse, Caro Medelius (cmedelius@wikimedia.org) is already a WMF employee.  The NDA is covered under their employment agreement with the WMF.
[18:25:23] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1020.eqiad.wmnet with reason: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad
[18:26:42] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] (duration: 06m 48s)
[18:26:45] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[18:30:53] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "lgtm, deploying soon" [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[18:30:58] <wikibugs>	 (03PS4) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[18:32:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429707 (10MoritzMuehlenhoff) Thanks, I'll rebuild the software RAID tomorrow
[18:32:50] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:33:25] <mutante>	 jouncebot: nowandnext
[18:33:26] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800)
[18:33:26] <jouncebot>	 In 2 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100)
[18:34:08] <mutante>	 scap config does not have a php_version version variable anymore now.  but it was removed in scap itself https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1021
[18:34:49] <mutante>	 out of abundance of caution.. mentioning it for next scap run
[18:35:00] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] lvs1020: move row C vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207877 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney)
[18:35:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[18:36:55] <wikibugs>	 (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins)
[18:37:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Reclaim components from decommed servers - https://phabricator.wikimedia.org/T411533#11429730 (10VRiley-WMF) 05Open→03Resolved Swapped 6 x 1.6TB with 1.9 TB SSDs  Reclaimed  8 x 32 gig pc4-2666 4 x 750w power supplies 10 x 32 gig pc4-3200  I was unable to swap some memor...
[18:37:03] <wikibugs>	 (03PS8) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549
[18:37:13] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS trixie
[18:45:16] <wikibugs>	 (03CR) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn)
[18:52:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11429823 (10JKelsoteel-WMF) Hey @jhathaway - thanks for your input! I shared these points with Noah as well, and we were able to...
[18:52:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11429824 (10taavi) 05Open→03Declined
[18:54:51] <wikibugs>	 (03PS1) 10Cathal Mooney: lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609)
[18:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:55:27] <wikibugs>	 (03PS1) 10Dzahn: releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612
[18:55:59] <wikibugs>	 (03CR) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn)
[18:56:21] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7782/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney)
[18:56:36] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+1] lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney)
[18:57:18] <wikibugs>	 (03CR) 10Catrope: [C:03+1] OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[18:59:02] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudservices1006: use new yaml-based pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217)
[18:59:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[18:59:04] <wikibugs>	 (03PS1) 10Andrew Bogott: pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217)
[18:59:21] <wikibugs>	 (03PS1) 10Santiago Faci: wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805)
[18:59:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudservices1006: use new yaml-based pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[19:00:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci)
[19:00:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney)
[19:00:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429856 (10BCornwall)
[19:04:06] <wikibugs>	 (03PS2) 10Andrew Bogott: pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217)
[19:04:15] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[19:04:40] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[19:06:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS bullseye
[19:06:45] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86387 and previous config saved to /var/cache/conftool/dbconfig/20251203-190644-ladsgroup.json
[19:06:48] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[19:06:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@c...
[19:11:37] <wikibugs>	 (03PS4) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438)
[19:12:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle)
[19:12:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429917 (10VRiley-WMF) a:03VRiley-WMF
[19:14:56] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS trixie
[19:15:51] <jinxer-wm>	 FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:19:35] <wikibugs>	 (03CR) 10Hashar: [C:04-1] Ease configuration of the motd banner (032 comments) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar)
[19:20:51] <jinxer-wm>	 RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:20:59] <wikibugs>	 (03PS5) 10Hashar: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221
[19:21:04] <wikibugs>	 (03PS1) 10Cathal Mooney: Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618
[19:21:22] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar)
[19:21:51] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage
[19:21:52] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86388 and previous config saved to /var/cache/conftool/dbconfig/20251203-192152-ladsgroup.json
[19:22:10] <topranks>	 !log disabling remote announcement of bgp prefixes 
[19:22:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney)
[19:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:15] <wikibugs>	 (03Merged) 10jenkins-bot: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar)
[19:22:17] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.network.cf
[19:22:19] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[19:22:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney)
[19:22:58] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@93bde2a]: Ease configuration of the motd banner
[19:23:07] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@93bde2a]: Ease configuration of the motd banner (duration: 00m 09s)
[19:23:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney)
[19:25:02] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100%
[19:28:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429987 (10Jclark-ctr) Dell ticket opened Service request 219590203
[19:28:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage
[19:29:58] <jinxer-wm>	 FIRING: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[19:30:04] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 105.40 ms
[19:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:32:58] <jinxer-wm>	 FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from IT) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[19:37:00] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86390 and previous config saved to /var/cache/conftool/dbconfig/20251203-193659-ladsgroup.json
[19:37:58] <jinxer-wm>	 FIRING: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from ES) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[19:38:46] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy)
[19:38:52] <wikibugs>	 (03PS1) 10D3r1ck01: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652)
[19:39:07] <wikibugs>	 (03PS1) 10D3r1ck01: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652)
[19:39:43] <jinxer-wm>	 FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152299 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:40:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[19:40:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[19:42:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11430063 (10Jclark-ctr) a:05Jclark-ctr→03None
[19:42:59] <jinxer-wm>	 RESOLVED: [3x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from ES) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh
[19:44:43] <jinxer-wm>	 RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:44:58] <jinxer-wm>	 RESOLVED: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[19:51:26] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1020.eqiad.wmnet with OS bullseye
[19:51:31] <wikibugs>	 (03CR) 10Dzahn: "pinged in Slack for verification - CCing clinic duty - fyi touching admin/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight)
[19:51:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin...
[19:52:08] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86392 and previous config saved to /var/cache/conftool/dbconfig/20251203-195207-ladsgroup.json
[19:52:11] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[19:53:54] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:55:21] <wikibugs>	 (03CR) 10Dzahn: "I think we need to coordinate a bit on the plans now given that the CDN thing has picked up speed now and the remaining time." [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[19:55:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:55:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[19:56:54] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:57:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "+1 to the idea of monitoring this (without being able to actually test it.. can we test it?  like merge and then write a high but not too " [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth)
[19:57:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] geo-resources: add gerrit-addrs resource [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[19:57:17] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[19:58:27] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[19:58:49] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "let's please revisit this after the new gerrit-lb has been setup - which is happening very soon" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[19:58:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:59:45] <wikibugs>	 (03CR) 10Santiago Faci: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci)
[20:00:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:01:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430098 (10cmooney)
[20:02:04] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211551 does the opposite - are both patches still useful now?" [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[20:03:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:04:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[20:04:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[20:06:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430105 (10Jclark-ctr)
[20:07:40] <wikibugs>	 (03CR) 10Dzahn: "kind of duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202842 but also better since it adds both needed services - but" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[20:09:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[20:11:19] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh)
[20:14:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430132 (10cmooney)
[20:16:42] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] dns.admin: add gerrit-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[20:19:13] <wikibugs>	 (03PS4) 10Dzahn: service: add gerrit service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532)
[20:19:35] <wikibugs>	 (03PS5) 10Dzahn: service: add gerrit-https service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532)
[20:20:33] <wikibugs>	 (03CR) 10Dzahn: "since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202842 already has reviews and I reacted to them.. and I have also heard comme" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[20:20:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:22:41] <wikibugs>	 (03Merged) 10jenkins-bot: dns.admin: add gerrit-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[20:22:53] <wikibugs>	 (03CR) 10Dzahn: "if you are ok with it I would rebase this on the other so it becomes just the gerrit-ssh part. that way we get them both, each of us did o" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[20:23:45] <wikibugs>	 (03Abandoned) 10Santiago Faci: wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci)
[20:25:49] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS trixie
[20:25:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:25:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott)
[20:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11430168 (10VRiley-WMF) 05Open→03Resolved These cables at eqiad have been physically removed and deleted in netbox.
[20:26:33] <wikibugs>	 (03PS5) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[20:27:25] <wikibugs>	 (03PS1) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631
[20:27:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci)
[20:28:08] <wikibugs>	 (03PS6) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[20:29:42] <wikibugs>	 (03CR) 10Ssingh: "Hi Scott. I will follow up on Monday with a tentative plan. Sorry about the delay -- we have been busy with other stuff and this got sidet" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins)
[20:29:54] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:30:48] <wikibugs>	 (03PS7) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` because of the platform renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805)
[20:31:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[20:32:39] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1214556 (https://phabricator.wikimedia.org/T311407) (owner: 10Muehlenhoff)
[20:32:58] <wikibugs>	 (03CR) 10SBassett: "Ok, sounds good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins)
[20:35:16] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1214550 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[20:35:18] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Switch conf2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214553 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[20:35:19] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Switch conf1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214557 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[20:35:20] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Switch conf1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214558 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[20:35:23] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Switch conf1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214561 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[20:36:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "got confirmation on Slack" [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight)
[20:36:32] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: use the new x-trusted-request header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214633 (https://phabricator.wikimedia.org/T410379)
[20:38:47] <wikibugs>	 (03PS2) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631
[20:40:11] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:40:37] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage
[20:41:11] <wikibugs>	 (03PS3) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631
[20:42:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:42:58] <wikibugs>	 (03PS1) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214636
[20:43:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet
[20:43:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet
[20:43:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430232 (10BCornwall)
[20:44:15] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage
[20:47:13] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8] (hadoop-test): Deploy spur hqls TEST [analytics/refinery@6dfb3b8b]
[20:48:14] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8] (hadoop-test): Deploy spur hqls TEST [analytics/refinery@6dfb3b8b] (duration: 01m 01s)
[20:48:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Regenerate awight yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight)
[20:49:11] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8]: Deploy spur hqls [analytics/refinery@6dfb3b8b]
[20:50:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Updating RobH ssh pubkey file to add fido backing - https://phabricator.wikimedia.org/T411678 (10RobH) 03NEW
[20:50:50] <wikibugs>	 (03PS1) 10RobH: RobH yubikey ssh pubkey update [puppet] - 10https://gerrit.wikimedia.org/r/1214638 (https://phabricator.wikimedia.org/T411678)
[20:51:40] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8]: Deploy spur hqls [analytics/refinery@6dfb3b8b] (duration: 02m 29s)
[20:51:58] <logmsgbot>	 !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8] (thin): Deploy spur hqls THIN [analytics/refinery@6dfb3b8b]
[20:52:08] <wikibugs>	 (03CR) 10RobH: [C:03+2] RobH yubikey ssh pubkey update [puppet] - 10https://gerrit.wikimedia.org/r/1214638 (https://phabricator.wikimedia.org/T411678) (owner: 10RobH)
[20:52:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:53:14] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8] (thin): Deploy spur hqls THIN [analytics/refinery@6dfb3b8b] (duration: 01m 16s)
[20:54:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679 (10AStein-WMF) 03NEW
[20:56:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Updating RobH ssh pubkey file to add fido backing - https://phabricator.wikimedia.org/T411678#11430286 (10RobH) 05Open→03Resolved
[20:56:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430288 (10greg) As @AStein-WMF 's manager, I approve.
[20:56:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz)
[20:56:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz)
[20:57:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430304 (10greg)
[20:58:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430310 (10AStein-WMF) context slack thread: https://wikimedia.slack.com/archives/CSV483812/p1764777672488669  2 things:    # this is fairly time sensitive...
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100).
[21:00:05] <jouncebot>	 aude, danisztls, maryum, xSavitar, and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:23] <danisztls>	 o/
[21:00:27] <xSavitar>	 o/
[21:02:25] <AaronSchulz>	 my change is low risk and can be batched with others
[21:02:26] <aude>	 i can deploy config patches
[21:02:36] <maryum>	 hi! I can also deploy my own with spiderpig
[21:03:02] <danisztls>	 I can self-deploy as well. Mine is also low risk, just increasing surveys coverage.
[21:03:09] <aude>	 ok, then i will do just mine
[21:03:12] <aude>	 starting
[21:03:54] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:04:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora)
[21:04:18] <xSavitar>	 aude, go for it. I can self-service when it's time.
[21:04:24] <aude>	 ok
[21:04:24] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214642
[21:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora)
[21:05:34] <logmsgbot>	 !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]]
[21:05:38] <stashbot>	 T410163: [Legal Footer] Create config and logic for adding legal footer - https://phabricator.wikimedia.org/T410163
[21:06:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:08:06] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[21:08:11] <logmsgbot>	 !log aude@deploy2002 aude, lmora: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:08:25] <maryum>	 let me know when I can go
[21:08:51] <aude>	 checking our change
[21:08:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:09:53] <danisztls>	 maryum: I can deploy yours together with mine if you want.
[21:10:00] <maryum>	 yep please go ahead
[21:10:08] <logmsgbot>	 !log aude@deploy2002 aude, lmora: Continuing with sync
[21:13:06] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[21:13:19] <AaronSchulz>	 danisztls: you can do mine as well :)
[21:13:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:14:13] <logmsgbot>	 !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]] (duration: 08m 38s)
[21:14:16] <stashbot>	 T410163: [Legal Footer] Create config and logic for adding legal footer - https://phabricator.wikimedia.org/T410163
[21:14:19] <aude>	 we're done
[21:14:50] <wikibugs>	 (03PS1) 10Jforrester: Followup I81a2c4de77: Verify stats label values are not empty [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585)
[21:15:14] <wikibugs>	 (03CR) 10Jforrester: "Proposing as a cherry-pick rather than waiting two weeks to find out if this fixes the logspam (given there's no train next week)." [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester)
[21:15:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza)
[21:15:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[21:16:17] <wikibugs>	 (03Merged) 10jenkins-bot: Increase coverage of 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza)
[21:16:34] <wikibugs>	 (03Merged) 10jenkins-bot: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[21:17:04] <logmsgbot>	 !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]]
[21:17:08] <stashbot>	 T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918
[21:17:09] <stashbot>	 T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664
[21:19:40] <logmsgbot>	 !log dani@deploy2002 dani, mstyles: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:20:11] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:20:12] <danisztls>	 maryum: can you test?
[21:20:21] <maryum>	 yes I can test now?
[21:20:28] <danisztls>	 maryum: yes
[21:20:36] <danisztls>	 AaronSchulz: sorry, I saw your message too late
[21:23:10] <wikibugs>	 (03Abandoned) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631 (owner: 10Andriy.v)
[21:24:11] <danisztls>	 maryum: should I continue with sync?
[21:24:17] <maryum>	 yes please
[21:24:22] <logmsgbot>	 !log dani@deploy2002 dani, mstyles: Continuing with sync
[21:28:21] <logmsgbot>	 !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]] (duration: 11m 18s)
[21:28:26] <stashbot>	 T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918
[21:28:26] <stashbot>	 T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664
[21:28:46] <danisztls>	 xSavitar: all yours
[21:28:57] <xSavitar>	 Thanks!
[21:29:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11430447 (10andrea.denisse)
[21:29:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[21:29:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[21:30:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11430448 (10andrea.denisse)
[21:30:59] <xSavitar>	 AaronSchulz, I can ping you once I'm done. Sounds good?
[21:31:07] <AaronSchulz>	 ok
[21:31:55] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411684 (10phaultfinder) 03NEW
[21:32:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430460 (10andrea.denisse) 05Open→03In progress p:05Triage→03High a:03andrea.denisse
[21:32:42] <maryum>	 danisztls thanks so much!
[21:34:18] <wikibugs>	 (03Merged) 10jenkins-bot: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[21:34:24] <wikibugs>	 (03Merged) 10jenkins-bot: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01)
[21:34:59] <logmsgbot>	 !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]]
[21:35:02] <stashbot>	 T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652
[21:35:11] <jinxer-wm>	 FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage  - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[21:37:49] <logmsgbot>	 !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:38:29] <xSavitar>	 Nothing to verify, will monitor logs after deployment
[21:38:33] <logmsgbot>	 !log derick@deploy2002 derick, d3r1ck01: Continuing with sync
[21:40:27] <wikibugs>	 (03PS2) 10Bking: opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956)
[21:41:30] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking)
[21:42:32] <logmsgbot>	 !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]] (duration: 07m 33s)
[21:42:35] <stashbot>	 T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652
[21:42:54] <xSavitar>	 AaronSchulz, over to you. I'm done!
[21:43:15] * AaronSchulz goes
[21:43:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz)
[21:44:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update Math API title and project-specific /math/ endpoint stability policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz)
[21:45:10] <logmsgbot>	 !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]]
[21:45:13] <stashbot>	 T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517
[21:47:26] <logmsgbot>	 !log aaron@deploy2002 aaron: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:48:41] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking)
[21:48:54] <wikibugs>	 (03PS1) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656
[21:49:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah)
[21:49:34] <logmsgbot>	 !log aaron@deploy2002 aaron: Continuing with sync
[21:50:10] <wikibugs>	 (03PS2) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656
[21:52:15] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah)
[21:53:35] <logmsgbot>	 !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]] (duration: 08m 25s)
[21:53:38] <stashbot>	 T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517
[21:54:01] <wikibugs>	 (03PS3) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656
[21:55:45] <AaronSchulz>	 done
[21:57:34] <wikibugs>	 (03PS1) 10Majavah: Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657
[21:58:53] <wikibugs>	 (03PS2) 10Majavah: Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657
[21:59:45] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657 (owner: 10Majavah)
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2200)
[22:02:11] <wikibugs>	 (03PS1) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664)
[22:03:20] <wikibugs>	 (03PS2) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664)
[22:04:22] <wikibugs>	 (03PS1) 10Majavah: Revert^2 "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214660
[22:05:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Revert^2 "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214660 (owner: 10Majavah)
[22:07:25] <wikibugs>	 (03PS4) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656
[22:08:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430945 (10AStein-WMF) also tagging in @BTullis
[22:08:31] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[22:09:09] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[22:11:46] <wikibugs>	 (03PS5) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656
[22:13:14] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[22:13:42] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah)
[22:14:01] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[22:14:19] <wikibugs>	 (03PS3) 10Bking: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956)
[22:14:31] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking)
[22:14:33] <wikibugs>	 (03CR) 10Catrope: [C:03+1] OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles)
[22:16:10] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking)
[22:16:42] <wikibugs>	 (03CR) 10Andrew Bogott: dnsrecursor: Use additional_forward_zones with new config format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah)
[22:17:30] <wikibugs>	 (03PS6) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608)
[22:18:08] <wikibugs>	 (03CR) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah)
[22:19:28] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah)
[22:21:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah)
[22:21:20] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah)
[22:25:57] <wikibugs>	 (03CR) 10JHathaway: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah)
[22:31:27] <wikibugs>	 (03PS1) 10Ryan Kemper: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568)
[22:33:15] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[22:33:18] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[22:36:35] <wikibugs>	 10SRE-Access-Requests: Add FIDO-backed SSH key for brennen - https://phabricator.wikimedia.org/T411730 (10brennen) 03NEW
[22:37:02] <wikibugs>	 (03PS1) 10Brennen Bearnes: admin: add fido backed ssh key for brennen [puppet] - 10https://gerrit.wikimedia.org/r/1214665 (https://phabricator.wikimedia.org/T411730)
[22:47:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] openstack: puppet: Remove support for X-Enc-Edit-Git [puppet] - 10https://gerrit.wikimedia.org/r/1214490 (owner: 10Majavah)
[22:48:07] <wikibugs>	 (03PS5) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679)
[22:50:54] <mutante>	 !log maintenance on https://codesearch.wmcloud.org/ - trying to fix disk space issue
[22:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:10] <mutante>	 !log maintenance on https://codesearch.wmcloud.org/ - trying to fix disk space issue - detaching volume to extend it
[22:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11431064 (10Ahoelzl) Approved.
[22:54:55] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[22:55:04] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "I don't think this needs shell access - it sounds like it's all about access to private data on dashboards - so that is level 1" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[22:55:11] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:55:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11431069 (10Novem_Linguae)
[22:55:46] <wikibugs>	 10ops-eqiad, 06DC-Ops: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731 (10Jhancock.wm) 03NEW
[22:56:24] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "you can copy from one of the existing users with the line "    ssh_keys: []  # Added with no SSH access, for membership in analytics-priva" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[22:57:45] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "generally yet another case of https://phabricator.wikimedia.org/T405517" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[22:58:56] <wikibugs>	 (03PS6) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679)
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2300)
[23:01:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "yea, this should fix access on dashboards" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[23:01:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11431105 (10Dzahn)
[23:02:00] <wikibugs>	 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11431106 (10Dzahn)
[23:02:33] <wikibugs>	 (03CR) 10Andrea Denisse: "Thanks Daniel, I've updated the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[23:02:34] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "As I understand it, the user requires shell level access to query druid programmatically from a stat host." [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[23:04:19] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11431113 (10bking)
[23:05:05] <wikibugs>	 (03PS7) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679)
[23:05:11] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] "Looks good, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[23:06:06] <wikibugs>	 (03CR) 10Dzahn: "well, the ticket asks for " Specifically, i'm trying to programmatically access the data in this turnilo dash " and stat hosts or other th" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse)
[23:08:20] <Amir1>	 !log hard rebooting codesearch9.codesearch.eqiad1.wikimedia.cloud (T411728)
[23:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:23] <stashbot>	 T411728: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728
[23:12:28] <wikibugs>	 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11431123 (10Dzahn) More examples:  T411679 - requestor actively says they don't know the level - request gets approved regardless - discussion on actual code review if shell a...
[23:13:26] <mutante>	 Amir1: if that service is down that is a good thing :P
[23:13:43] <Amir1>	 I step back, Have fun :P
[23:13:46] <mutante>	 Amir1: sounds weird. lol.. what I mean is.. I wanted to unmount the volume basically when that ticket came in
[23:13:49] <Amir1>	 let me know if I can help on anything
[23:14:01] <Amir1>	 yeah, I know, don't worry
[23:14:10] <mutante>	 have to unmount the volume to resize it
[23:14:32] <mutante>	 and have to resize it because there is still not enough space to make a larger volume than the existing one
[23:14:39] <mutante>	 without asking for more quota yet another time
[23:14:49] <mutante>	 I hate the resizefs part though
[23:15:24] <mutante>	 Amir1: rebooting it was great though, needed that :)
[23:19:01] <mutante>	 ok, shutting it down again for maintenance action..
[23:19:38] <wikibugs>	 (03PS1) 10CDanis: support.arraynetworks.net should not trigger error alerts [puppet] - 10https://gerrit.wikimedia.org/r/1214670 (owner: 10Jdlrobson)
[23:26:10] <mutante>	 Amir1: this was actually less painful and time consuming than expected. double the size of /srv/ and with it there are plenty of inodes now. til next time
[23:26:28] <mutante>	 (only possible after the quota request)
[23:30:11] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:31:33] <wikibugs>	 (03PS3) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664)
[23:44:01] <wikibugs>	 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975#11431189 (10RLazarus) Envoy 1.35.7 is about to come out, with security fixes: https://groups.google.com/g/envoy-announce/c/zr2OzwmJFqY  None of these issues affect us urgently, but since we're early in the 1.35...
[23:44:18] <wikibugs>	 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11431190 (10RLazarus)
[23:54:39] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus)
[23:54:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:55:06] <Amir1>	 mutante: Thank you!
[23:55:55] <mutante>	 :)
[23:59:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency