[00:00:47] (03Merged) 10jenkins-bot: Close crwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214056 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe) [00:01:22] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1214056|Close crwiki (T411501)]] [00:01:26] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [00:01:33] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [00:01:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86338 and previous config saved to /var/cache/conftool/dbconfig/20251203-000140-ladsgroup.json [00:01:44] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [00:04:10] !log zabe@deploy2002 zabe: Backport for [[gerrit:1214056|Close crwiki (T411501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:05:16] !log zabe@deploy2002 zabe: Continuing with sync [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:21] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214056|Close crwiki (T411501)]] (duration: 07m 59s) [00:09:24] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [00:13:25] (03PS4) 10Zabe: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) [00:15:34] (03CR) 10Zabe: [C:03+2] Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe) [00:16:22] (03Merged) 10jenkins-bot: Close klwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214057 (https://phabricator.wikimedia.org/T411501) (owner: 10Zabe) [00:17:16] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1214057|Close klwiki (T411501)]] [00:17:20] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [00:17:53] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11426536 (10Dwisehaupt) @Jhancock.wm Yes I did and I can get in. I just checked again and it looks like something may have been missed. I see the host in netbox only has a mana... [00:19:40] !log zabe@deploy2002 zabe: Backport for [[gerrit:1214057|Close klwiki (T411501)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:20:41] !log zabe@deploy2002 zabe: Continuing with sync [00:24:45] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214057|Close klwiki (T411501)]] (duration: 07m 29s) [00:24:48] T411501: Close crwiki and klwiki - https://phabricator.wikimedia.org/T411501 [00:26:12] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275) [00:26:15] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by zabe@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [00:26:37] ^ moving crwiki and klwiki to wmf.5 [00:27:08] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214183 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [00:33:31] !log zabe@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.5 refs T408275 [00:33:35] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [00:35:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190 [00:40:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190 (owner: 10TrainBranchBot) [00:45:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:48:46] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [00:50:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS trixie [00:52:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214190 (owner: 10TrainBranchBot) [00:58:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:00] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:03:48] (03CR) 10Ssingh: [C:03+1] "There is no harm in merging them before the other stuff is done but there isn't much value either. We can decide tomorrow!" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [01:04:49] (03CR) 10Ssingh: [C:03+1] "Thanks for the patch! This should go in at the very end of the changes in a way. (I will upload a related patch for actually adding this n" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [01:08:56] (03PS1) 10Ssingh: conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) [01:09:44] 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 (10RKemper) 03NEW [01:10:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193 [01:10:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193 (owner: 10TrainBranchBot) [01:14:31] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 30s) [01:18:24] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [01:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:21:25] ryankemper@cumin2002 reboot-workers (PID 1626431) is awaiting input [01:21:42] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [01:23:02] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [01:23:35] 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11426647 (10RKemper) an-worker* reboots ongoing now [01:24:47] (03PS5) 10RLazarus: deployment_server: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 [01:25:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:25:23] (03CR) 10RLazarus: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [01:33:25] (03PS1) 10Sbisson: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 [01:35:00] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:36:12] (03CR) 10RLazarus: "Summarizing what we discussed after that:" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [01:37:02] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214193 (owner: 10TrainBranchBot) [01:39:48] (03CR) 10CDanis: [C:03+1] conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh) [01:43:44] andrew@cumin2002 reimage (PID 1612025) is awaiting input [02:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:25] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1006.eqiad.wmnet with OS trixie [02:13:26] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS trixie [02:27:24] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [02:27:31] (03PS4) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) [02:27:33] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [02:28:07] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [02:28:37] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:37] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:37] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.130 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:39] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:39] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:41] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:43] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.155 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1011.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:28:43] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:45] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:45] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:45] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:47] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:47] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:47] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:53] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:28:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:05] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:05] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.095 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:05] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.113 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:07] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.275 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:14] woah [02:29:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1018.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [02:29:23] !incidents [02:29:23] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.095 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:23] 7079 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [02:29:24] 7078 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [02:29:24] 7077 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [02:29:24] 7072 (RESOLVED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [02:29:24] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [02:29:24] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [02:29:25] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [02:29:27] !ack 7079 [02:29:28] !incidents [02:29:28] 7079 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [02:29:28] 7079 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [02:29:28] 7078 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [02:29:29] 7077 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [02:29:29] 7072 (RESOLVED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [02:29:29] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [02:29:29] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [02:29:35] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:35] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:35] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.653 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.200 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:37] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.255 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:38] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:38] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:39] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:39] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:40] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 1.622 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:43] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:45] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:46] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:46] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:47] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:29:47] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [02:30:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:30:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:30:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:30:21] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:27] !incidents [02:30:28] 7079 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [02:30:28] 7080 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [02:30:28] 7081 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [02:30:28] 7078 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [02:30:28] 7077 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [02:30:28] !ack 7080 [02:30:29] 7072 (RESOLVED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [02:30:29] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [02:30:29] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [02:30:29] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [02:30:30] 7080 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [02:30:35] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:37] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:38] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:38] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:39] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:39] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:40] !ack 7081 [02:30:40] 7081 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [02:30:40] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:40] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:41] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.340 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:41] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.509 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:42] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.632 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:42] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:43] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:43] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.559 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:44] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.656 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:44] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.649 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:45] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:45] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [02:30:46] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:33:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:34:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [02:34:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [02:35:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:35:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:35:27] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Swift [02:36:06] (03CR) 10Krinkle: "Using the example of:" [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) (owner: 10Krinkle) [02:36:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:36:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:40:14] (03PS5) 10Krinkle: varnish: Move error message from footer to body for HTTP 4xx responses [puppet] - 10https://gerrit.wikimedia.org/r/1214156 (https://phabricator.wikimedia.org/T401489) [02:40:20] (03PS2) 10Krinkle: robots.php: Clean up unused site, lang, and x-subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) [02:41:21] (03PS2) 10Krinkle: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 [02:41:46] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:44:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [02:45:51] !incidents [02:45:51] 7082 (ACKED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [02:45:52] 7083 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [02:45:52] 7084 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [02:45:52] 7081 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [02:45:52] 7080 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [02:45:52] 7079 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [02:45:53] 7078 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [02:45:53] 7077 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [02:45:53] 7072 (RESOLVED) [2x] TransitPeeringTransportOutboundSaturation network sre (gnmi) [02:45:54] 7074 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [02:45:54] 7073 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-magru.wikimedia.org) [02:45:55] 7075 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: cr2-eqiad:xe-1/0/1:1 {#180180823000240:1} xe-1/0/1:1 gnmi eqiad) [02:46:46] FIRING: [3x] Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:46:46] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [02:49:51] RESOLVED: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [02:50:27] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [02:51:34] (03PS2) 10Krinkle: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) [02:51:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:54:26] (03PS3) 10Krinkle: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) [02:54:27] (03PS2) 10Krinkle: robots.txt: Clean up inline comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 [02:54:27] (03PS3) 10Krinkle: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 [02:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:55:27] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Swift [02:57:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) (owner: 10Krinkle) [02:57:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [02:57:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 (owner: 10Krinkle) [02:57:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 (owner: 10Krinkle) [02:58:33] (03Merged) 10jenkins-bot: robots.php: Clean up unused site, lang, and x-subdomain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201740 (https://phabricator.wikimedia.org/T407122) (owner: 10Krinkle) [02:58:35] (03Merged) 10jenkins-bot: Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214148 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [02:58:38] (03Merged) 10jenkins-bot: robots.txt: Clean up inline comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214149 (owner: 10Krinkle) [02:58:39] (03Merged) 10jenkins-bot: robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214150 (owner: 10Krinkle) [02:59:36] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] [02:59:41] T407122: [5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints - https://phabricator.wikimedia.org/T407122 [02:59:41] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [03:00:27] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Swift [03:02:25] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] synced to the testservers (see https://wiki [03:02:26] tech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:03:57] !log krinkle@deploy2002 krinkle: Continuing with sync [03:06:28] (03PS1) 10Krinkle: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 [03:08:02] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1201740|robots.php: Clean up unused site, lang, and x-subdomain (T407122)]], [[gerrit:1214148|Submit Commons sitemap to Bing/DuckDuckGo and remaining wikis to Google (T400023)]], [[gerrit:1214149|robots.txt: Clean up inline comments]], [[gerrit:1214150|robots.txt: Remove redundant "/wiki/Fundraising_2007/comments" disallow]] (duration: 08m 26s) [03:08:07] T407122: [5.2.5 Milestone] Introduce API Gateway access controls on sitemap endpoints - https://phabricator.wikimedia.org/T407122 [03:08:07] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [03:08:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1006.eqiad.wmnet with OS trixie [03:09:14] (03PS2) 10Krinkle: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) [03:09:21] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [03:11:00] (03CR) 10Krinkle: [C:03+1] Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [03:13:52] (03CR) 10TrainBranchBot: "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [03:14:41] (03Merged) 10jenkins-bot: robots.php: Avoid "404 Not Found" for Sitemap rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [03:15:12] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] [03:15:16] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [03:17:53] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:22:19] !log krinkle@deploy2002 krinkle: Continuing with sync [03:26:20] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214201|robots.php: Avoid "404 Not Found" for Sitemap rule (T400023)]] (duration: 11m 08s) [03:26:24] T400023: Deploy sitemaps API for Commons - https://phabricator.wikimedia.org/T400023 [03:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:30:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1007.eqiad.wmnet with OS trixie [03:46:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [03:50:06] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1007.eqiad.wmnet with reason: host reimage [04:26:16] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1007.eqiad.wmnet with OS trixie [04:34:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1011.eqiad.wmnet with OS trixie [04:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:43:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [04:48:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [04:53:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:54:40] (03PS1) 10Marostegui: installserver: Add db1169 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1214220 (https://phabricator.wikimedia.org/T411498) [04:55:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage [04:57:16] (03CR) 10Marostegui: [C:03+2] installserver: Add db1169 to preseed [puppet] - 10https://gerrit.wikimedia.org/r/1214220 (https://phabricator.wikimedia.org/T411498) (owner: 10Marostegui) [04:57:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [04:58:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86339 and previous config saved to /var/cache/conftool/dbconfig/20251203-045851-marostegui.json [04:58:56] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:58:56] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:59:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1011.eqiad.wmnet with reason: host reimage [05:01:24] (03PS1) 10Marostegui: installserver: Change db1169 order [puppet] - 10https://gerrit.wikimedia.org/r/1214225 [05:02:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:02:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:02:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:53] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:52] (03CR) 10Marostegui: [C:03+2] installserver: Change db1169 order [puppet] - 10https://gerrit.wikimedia.org/r/1214225 (owner: 10Marostegui) [05:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86340 and previous config saved to /var/cache/conftool/dbconfig/20251203-051359-marostegui.json [05:15:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:18:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11426756 (10Marostegui) [05:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:20:25] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:20:50] (03PS1) 10Marostegui: installserver: Move new hosts to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1214235 (https://phabricator.wikimedia.org/T411570) [05:21:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:21:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:43] (03CR) 10Marostegui: [C:03+2] installserver: Move new hosts to UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1214235 (https://phabricator.wikimedia.org/T411570) (owner: 10Marostegui) [05:25:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:27:43] !log Drop sockpuppet database T411527 [05:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:46] T411527: Remove sockpuppet database - https://phabricator.wikimedia.org/T411527 [05:29:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P86341 and previous config saved to /var/cache/conftool/dbconfig/20251203-052906-marostegui.json [05:29:50] (03PS1) 10Marostegui: mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) [05:30:28] (03CR) 10Marostegui: "This is a NOOP as puppet does not control grants. The grants will have to be removed manually from the DB." [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui) [05:31:33] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [05:32:54] (03CR) 10Marostegui: [C:03+2] mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui) [05:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [05:36:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1011.eqiad.wmnet with OS trixie [05:40:17] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [05:41:19] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie [05:44:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86342 and previous config saved to /var/cache/conftool/dbconfig/20251203-054414-marostegui.json [05:44:19] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:44:20] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:44:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [05:44:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86343 and previous config saved to /var/cache/conftool/dbconfig/20251203-054438-marostegui.json [05:52:26] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86344 and previous config saved to /var/cache/conftool/dbconfig/20251203-055226-ladsgroup.json [05:52:29] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:58:39] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [06:05:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage [06:05:40] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86345 and previous config saved to /var/cache/conftool/dbconfig/20251203-060734-ladsgroup.json [06:13:29] ryankemper@cumin2002 reboot-workers (PID 1628404) is awaiting input [06:15:00] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:15:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:15:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:15:33] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:16:33] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [06:20:44] (03PS2) 10KartikMistry: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [06:22:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P86348 and previous config saved to /var/cache/conftool/dbconfig/20251203-062241-ladsgroup.json [06:23:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426852 (10andrea.denisse) [06:24:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426856 (10andrea.denisse) Hi @VPuffetMichel , do you approve this request? [06:26:32] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS trixie [06:26:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426858 (10andrea.denisse) 05Open→03In progress Hi @KFrancis, I was unable to find @medelius on the NDA spreadsheet, could you please help me to confirm their NDA status? [06:27:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11426861 (10andrea.denisse) [06:29:23] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1169 - Depooling db1169 [06:29:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1169 - Depooling db1169 [06:31:37] (03PS1) 10Marostegui: db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214243 (https://phabricator.wikimedia.org/T411498) [06:32:09] (03CR) 10Marostegui: [C:03+2] db1169: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214243 (https://phabricator.wikimedia.org/T411498) (owner: 10Marostegui) [06:35:24] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - Repooling db1169 [06:37:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T410589)', diff saved to https://phabricator.wikimedia.org/P86349 and previous config saved to /var/cache/conftool/dbconfig/20251203-063749-ladsgroup.json [06:37:53] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:38:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2224.codfw.wmnet with reason: Maintenance [06:38:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86350 and previous config saved to /var/cache/conftool/dbconfig/20251203-063812-ladsgroup.json [06:38:37] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11426873 (10andrea.denisse) 05Open→03In progress >>! In T411436#11425341, @SEgt-WMF wrote: > In case it is useful: the MediaWiki page @Rmaung pointed... [06:39:42] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1169 gradually with 4 steps - Repooling db1169 [06:40:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1169 gradually with 4 steps - Repooling db1169 [06:49:22] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#11426882 (10Marostegui) >>! In T360029#11425775, @Scott_French wrote: > Thanks for the heads-up, @Marostegui. Thank you for taking a look! >... [06:49:52] 06SRE, 10SRE-Access-Requests: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11426883 (10andrea.denisse) 05Open→03In progress I wrote to Zoe directly to confirm of this request. [06:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:56:46] !log ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 popups (T406724) [06:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:50] T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724 [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0700) [07:01:07] !log ladsgroup@deploy2002:~$ mwscript-k8s --dblist=all -- purgeUserOptions.php --login-age 11 rememberpassword (T406724) [07:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:31] !log installing mako security updates [07:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:17:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:17:53] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:23:14] (03PS1) 10Kosta Harlan: ConfirmEdit: Grant skipcaptcha to bot user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214394 (https://phabricator.wikimedia.org/T411575) [07:26:15] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1169 gradually with 4 steps - Repooling db1169 [07:27:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:30:45] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#11426937 (10MoritzMuehlenhoff) [07:31:47] (03PS1) 10Muehlenhoff: Switch conf2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214396 (https://phabricator.wikimedia.org/T349619) [07:32:02] (03Abandoned) 10Kosta Harlan: ConfirmEdit: Grant skipcaptcha to bot user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214394 (https://phabricator.wikimedia.org/T411575) (owner: 10Kosta Harlan) [07:32:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:53] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:35:51] (03PS7) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 [07:35:59] (03CR) 10CI reject: [V:04-1] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler) [07:36:45] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: extract Lua code for testability (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [07:37:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:38:01] (03PS1) 10Muehlenhoff: Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 [07:39:39] (03PS2) 10Muehlenhoff: Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 [07:42:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:20] !log installing libtpms security updates [07:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:43] FIRING: [20x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:42] (03PS1) 10Muehlenhoff: Add library hint for libtpms [puppet] - 10https://gerrit.wikimedia.org/r/1214401 [07:52:43] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:53] FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:23] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libtpms [puppet] - 10https://gerrit.wikimedia.org/r/1214401 (owner: 10Muehlenhoff) [07:57:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:57:53] FIRING: [13x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:03:43] (03CR) 10Brouberol: [C:03+2] Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [08:04:00] (03CR) 10Brouberol: [V:03+2] Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [08:04:11] (03Merged) 10jenkins-bot: Refactor moveLinkFile and putDumpChecksums [dumps] - 10https://gerrit.wikimedia.org/r/1204594 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [08:04:24] (03CR) 10Brouberol: [C:03+2] Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [08:04:33] (03CR) 10Brouberol: [C:03+2] Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [08:04:53] (03Merged) 10jenkins-bot: Add output-dir option to specify target directory for rdf dumps [dumps] - 10https://gerrit.wikimedia.org/r/1204595 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [08:04:55] (03Merged) 10jenkins-bot: Report integrity metric from Wikidata dump scripts [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [08:05:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:48] ryankemper@cumin2002 reboot-workers (PID 1628404) is awaiting input [08:07:43] RESOLVED: [7x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:10:43] (03CR) 10Jelto: [C:03+1] "I double checked the IPs discussed in T365259 and this looks good to me, thanks for preparing the patch" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:13:11] (03PS8) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 [08:13:19] !log installing python-zipp security updates [08:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:49] (03PS9) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 [08:16:56] (03PS2) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 [08:22:55] (03CR) 10Slyngshede: [C:03+1] Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:28:49] (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh) [08:32:21] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [08:34:13] (03CR) 10Ayounsi: [C:03+1] Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 (owner: 10Muehlenhoff) [08:37:48] !log upgrade Envoy on schema* T405808 [08:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:51] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [08:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:44:23] (03PS1) 10Awight: Regenerate awight yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1214448 [08:58:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:04] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T0900) [09:00:54] !log ayounsi@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ganeti-test2001.codfw.wmnet with reason: test CR1207804 [09:11:41] i am running the train [09:12:07] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) [09:12:09] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:12:59] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:14:11] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:14:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:15:59] (03PS1) 10Jelto: service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) [09:15:59] (03PS1) 10Jelto: conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) [09:19:20] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.5 refs T408275 [09:19:24] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [09:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:44:22] (03PS1) 10Giuseppe Lavagetto: modules: add the conftool module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214460 [09:44:22] (03PS1) 10Giuseppe Lavagetto: modules: copy over app.generic:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214461 [09:44:22] (03PS1) 10Giuseppe Lavagetto: app.generic: Add conftool volumes and volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214462 [09:44:23] (03PS1) 10Giuseppe Lavagetto: charts/python-webapp: add conftool support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214463 [09:45:13] (03CR) 10JMeybohm: "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 (owner: 10Clément Goubert) [09:54:35] (03CR) 10MonAx the Developer: "Limit thanks for new users at uk.wikipedia to 3 per hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214452 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:59:36] (03PS1) 10Fabfur: external_clouds_vendors: added ahrefsbot [puppet] - 10https://gerrit.wikimedia.org/r/1214465 [10:03:37] (03CR) 10Jcrespo: [C:03+1] mariadb: Remove sockpuppet database grants [puppet] - 10https://gerrit.wikimedia.org/r/1214237 (https://phabricator.wikimedia.org/T411527) (owner: 10Marostegui) [10:06:05] (03PS17) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [10:07:52] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11427333 (10ayounsi) I was looking into that for the LLDP issue, here are some Redfish path that could be useful in that context : ` >>> spicerack.redfish('ganeti-test2001').mo... [10:08:52] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11427337 (10Peachey88) [10:10:07] (03CR) 10Vgutierrez: [C:03+1] external_clouds_vendors: added ahrefsbot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur) [10:10:07] (03CR) 10Blake: [C:03+2] alerting: Update severity of KafkaRollingRestartRequired to Task. [alerts] - 10https://gerrit.wikimedia.org/r/1212599 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [10:11:32] (03PS4) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [10:11:49] (03CR) 10Giuseppe Lavagetto: [C:04-2] "Please let's start using https://requestctl.wikimedia.org/ipblock_source for this stuff." [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur) [10:12:17] (03PS1) 10Kosta Harlan: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578) [10:12:25] (03PS5) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [10:12:30] (03PS1) 10Kosta Harlan: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578) [10:13:14] hashar: can I sync a patch? seems like the train deploy finished a while ago [10:14:20] (03CR) 10CI reject: [V:04-1] ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [10:17:45] (03CR) 10Elukey: [C:03+2] Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [10:17:57] (03PS3) 10Elukey: sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 [10:20:18] (03Abandoned) 10Fabfur: external_clouds_vendors: added ahrefsbot [puppet] - 10https://gerrit.wikimedia.org/r/1214465 (owner: 10Fabfur) [10:24:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan) [10:24:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan) [10:25:57] (03Merged) 10jenkins-bot: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214469 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan) [10:26:17] (03Merged) 10jenkins-bot: hCaptchaEditAttempt logging: Normalize line endings [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214470 (https://phabricator.wikimedia.org/T411578) (owner: 10Kosta Harlan) [10:27:03] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]] [10:27:06] T411578: hCaptcha edit attempt logs: Normalize line endings - https://phabricator.wikimedia.org/T411578 [10:29:08] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: make UEFI default [cookbooks] - 10https://gerrit.wikimedia.org/r/1214055 (owner: 10Elukey) [10:29:22] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:30:55] !log kharlan@deploy2002 kharlan: Continuing with sync [10:33:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86357 and previous config saved to /var/cache/conftool/dbconfig/20251203-103323-marostegui.json [10:33:29] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:33:29] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:34:59] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214469|hCaptchaEditAttempt logging: Normalize line endings (T411578)]], [[gerrit:1214470|hCaptchaEditAttempt logging: Normalize line endings (T411578)]] (duration: 07m 56s) [10:35:02] T411578: hCaptcha edit attempt logs: Normalize line endings - https://phabricator.wikimedia.org/T411578 [10:36:30] (03PS1) 10Filippo Giunchedi: service::catalog: add 'team' attribute [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) [10:38:33] (03PS2) 10Majavah: network: Remove unused cloud_nova_hosts_ranges variable [puppet] - 10https://gerrit.wikimedia.org/r/1214099 [10:38:33] (03PS1) 10Majavah: O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) [10:38:35] (03PS1) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) [10:39:36] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [10:40:44] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11427434 (10ayounsi) 05Resolved→03Open Thanks ! Those are still alerting in eqiad : ge-0/0/0 /dcim/interfaces/37836/ Interface enabled but not connected on fa... [10:41:46] (03PS2) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) [10:48:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:48:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86358 and previous config saved to /var/cache/conftool/dbconfig/20251203-104830-marostegui.json [10:49:38] (03PS1) 10Filippo Giunchedi: sre: multi-team ProbeDown [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) [10:53:05] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [10:53:13] (03PS2) 10Esanders: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) [10:53:36] !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host sretest2001 [10:53:38] (03PS3) 10Esanders: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) [10:54:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [10:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:58:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2001 [10:59:31] (03PS2) 10Majavah: O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) [10:59:31] (03PS3) 10Majavah: P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1100) [11:03:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P86359 and previous config saved to /var/cache/conftool/dbconfig/20251203-110338-marostegui.json [11:06:40] (03CR) 10Filippo Giunchedi: [C:03+1] O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [11:06:53] (03CR) 10Majavah: [C:03+2] O:wmcs::cloudvps_meta: Basic role + web server skeleton [puppet] - 10https://gerrit.wikimedia.org/r/1214474 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [11:07:06] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudvps_meta: Publish Cloud VPS JSON files [puppet] - 10https://gerrit.wikimedia.org/r/1214475 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [11:07:45] !log elukey@cumin1003 START - Cookbook sre.hosts.powercycle for host ml-serve1013 [11:08:09] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#11427590 (10JMonton-WMF) Another option: I worked in the past with https://github.com/devshawn/kafka-gitops to manage all topic set... [11:10:20] PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:58] RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [11:12:05] (03PS1) 10Majavah: P:wmcs::cloudvps_meta: Set creationTime [puppet] - 10https://gerrit.wikimedia.org/r/1214481 (https://phabricator.wikimedia.org/T411590) [11:12:48] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve1013 [11:13:56] (03CR) 10Majavah: [C:03+2] P:wmcs::cloudvps_meta: Set creationTime [puppet] - 10https://gerrit.wikimedia.org/r/1214481 (https://phabricator.wikimedia.org/T411590) (owner: 10Majavah) [11:15:08] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:15:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [11:18:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86360 and previous config saved to /var/cache/conftool/dbconfig/20251203-111846-marostegui.json [11:18:51] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:18:51] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:19:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [11:19:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86361 and previous config saved to /var/cache/conftool/dbconfig/20251203-111910-marostegui.json [11:23:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86362 and previous config saved to /var/cache/conftool/dbconfig/20251203-112345-marostegui.json [11:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:31:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:32:14] (03Abandoned) 10Awight: Monitoring for WMDE dumps scraper [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [11:32:18] (03CR) 10Awight: "Thanks for the tip, that's exactly what we'll do! Pushgateway is also helpful for caching results after the run is complete." [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [11:34:17] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:41] (03PS1) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 [11:38:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:38:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86363 and previous config saved to /var/cache/conftool/dbconfig/20251203-113853-marostegui.json [11:39:17] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:42] (03PS10) 10Daniel Kinzler: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 [11:40:59] (03PS3) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 [11:41:11] (03PS4) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 [11:41:22] (03PS2) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 [11:41:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:42:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:43:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:44:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:53] (03PS1) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 [11:46:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:48:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:48:42] (03PS1) 10Tchanders: WIP Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 [11:50:31] (03CR) 10Clément Goubert: [C:03+1] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler) [11:50:58] (03PS2) 10Elukey: sre.hosts.reimage: remove puppet 5 support and default to 7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) [11:51:09] (03PS1) 10Majavah: openstack: puppet: Remove support for X-Enc-Edit-Git [puppet] - 10https://gerrit.wikimedia.org/r/1214490 [11:51:09] (03PS1) 10Majavah: openstack: puppet: Do not commit empty role fiels [puppet] - 10https://gerrit.wikimedia.org/r/1214491 [11:54:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86364 and previous config saved to /var/cache/conftool/dbconfig/20251203-115401-marostegui.json [11:55:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:55:44] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [11:58:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:59:12] (03CR) 10Clément Goubert: [C:03+1] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler) [11:59:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1200). [12:03:08] (03PS1) 10Btullis: Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833) [12:04:17] FIRING: [16x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:26] (03CR) 10Btullis: [C:03+2] Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612 (10thiemowmde) 03NEW [12:06:50] (03Merged) 10jenkins-bot: Add kerberos related configuration to the spark-defaults.conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214492 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:08:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:08:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11427904 (10thiemowmde) [12:08:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11427905 (10Tobi_WMDE_SW) As the Engineering Manager of the team Thiemo works on, I support this request. [12:09:00] (03PS1) 10Majavah: P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) [12:09:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86365 and previous config saved to /var/cache/conftool/dbconfig/20251203-120909-marostegui.json [12:09:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:09:14] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:09:17] FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [12:09:32] (03PS1) 10DDesouza: Increase coverage of 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) [12:09:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86366 and previous config saved to /var/cache/conftool/dbconfig/20251203-120933-marostegui.json [12:10:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7777/co" [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah) [12:10:06] (03CR) 10Slyngshede: [C:03+2] C:openldap extend wikimediaPerson schema for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [12:10:22] (03CR) 10Slyngshede: [C:03+2] C:openldap extend wikimediaPerson schema for Phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [12:10:45] (03PS3) 10Daniel Kinzler: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 [12:11:34] (03CR) 10Daniel Kinzler: [C:03+2] api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler) [12:11:38] (03CR) 10Muehlenhoff: [C:03+2] Remove mediawiki-testers group [puppet] - 10https://gerrit.wikimedia.org/r/1214110 (owner: 10Muehlenhoff) [12:11:38] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [12:12:03] (03PS2) 10Majavah: P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) [12:12:58] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7778/co" [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah) [12:13:07] (03CR) 10Clément Goubert: [C:03+1] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler) [12:13:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1214169 (owner: 10JHathaway) [12:13:20] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler) [12:13:33] (03Merged) 10jenkins-bot: api-gateway: add lua tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler) [12:13:34] (03Merged) 10jenkins-bot: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [12:14:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86367 and previous config saved to /var/cache/conftool/dbconfig/20251203-121409-marostegui.json [12:14:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:14:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:14:17] FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [12:15:21] (03Merged) 10jenkins-bot: rest gateway: simplify Lua code for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214486 (owner: 10Daniel Kinzler) [12:15:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [12:17:27] (03CR) 10Muehlenhoff: [C:04-1] "This looks good, but marking as -1 until the preconditions are resolved (puppet base classes, d-i, etc)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214488 (https://phabricator.wikimedia.org/T408219) (owner: 10Elukey) [12:17:47] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:18:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:18:34] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:19:17] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:30] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:19:37] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:19:45] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:20:17] (03CR) 10Muehlenhoff: [C:03+2] Add a Cumin alias to allow teams to track missing UEFI migrations [puppet] - 10https://gerrit.wikimedia.org/r/1214398 (owner: 10Muehlenhoff) [12:20:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) (owner: 10Tiziano Fogli) [12:20:38] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:23:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:24:17] FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:00] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:26:00] (03CR) 10FNegri: [C:03+1] P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah) [12:26:19] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: static: Publish worker IPs as a JSON file [puppet] - 10https://gerrit.wikimedia.org/r/1214493 (https://phabricator.wikimedia.org/T411610) (owner: 10Majavah) [12:26:40] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:28:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:29:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86368 and previous config saved to /var/cache/conftool/dbconfig/20251203-122912-ladsgroup.json [12:29:16] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:29:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:29:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86369 and previous config saved to /var/cache/conftool/dbconfig/20251203-122923-marostegui.json [12:30:08] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:30:35] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:32:00] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617 (10cmooney) 03NEW p:05Triage→03Medium [12:32:14] !log Restarting failed timer dump_cloud_ip_ranges on puppetservers [12:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: GDNSd discovery records: balance requests from POPs across core sites - https://phabricator.wikimedia.org/T411617#11428021 (10cmooney) [12:34:17] FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:20] (03CR) 10Majavah: [C:03+2] P:grafana: Default to UTC timezone [puppet] - 10https://gerrit.wikimedia.org/r/1213506 (https://phabricator.wikimedia.org/T411274) (owner: 10Majavah) [12:38:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:38:42] (03PS2) 10Tchanders: WIP Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) [12:39:17] FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:40:29] (03PS3) 10Daniel Kinzler: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 [12:40:38] (03CR) 10CI reject: [V:04-1] api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler) [12:41:32] (03PS4) 10Daniel Kinzler: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 [12:44:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86370 and previous config saved to /var/cache/conftool/dbconfig/20251203-124419-ladsgroup.json [12:44:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86371 and previous config saved to /var/cache/conftool/dbconfig/20251203-124430-marostegui.json [12:45:39] (03CR) 10Clément Goubert: [C:03+1] "That will add this information to the api-gateway's logs as well (which is fine). I can deploy the api-gateway, since it has quite a few u" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler) [12:46:28] (03CR) 10Daniel Kinzler: [C:03+2] api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler) [12:48:19] (03Merged) 10jenkins-bot: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 (owner: 10Daniel Kinzler) [12:49:17] FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:34] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:52] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:50:16] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:50:23] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:50:28] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:51:25] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:51:52] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:52:12] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:52:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11428100 (10MoritzMuehlenhoff) [12:52:33] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:53:56] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [12:54:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [12:54:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:54] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:56:14] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:56:28] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:57:01] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:58:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:17] FIRING: [16x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P86372 and previous config saved to /var/cache/conftool/dbconfig/20251203-125927-ladsgroup.json [12:59:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86373 and previous config saved to /var/cache/conftool/dbconfig/20251203-125938-marostegui.json [12:59:43] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [12:59:44] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [12:59:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2227.codfw.wmnet with reason: Maintenance [13:00:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86374 and previous config saved to /var/cache/conftool/dbconfig/20251203-130002-marostegui.json [13:00:11] (03PS2) 10Muehlenhoff: pcc: Drop obsolete OS conditional [puppet] - 10https://gerrit.wikimedia.org/r/1126104 [13:00:31] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:00:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:00:49] (03PS1) 10Jelto: devtools hiera: set gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) [13:01:21] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:17] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86375 and previous config saved to /var/cache/conftool/dbconfig/20251203-130437-marostegui.json [13:06:03] (03PS3) 10KartikMistry: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [13:06:57] (03PS1) 10Majavah: interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508 [13:06:57] (03PS1) 10Majavah: P:kubernetes: deployment_server: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214509 [13:08:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:08:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624 (10Rsilvola) 03NEW [13:08:24] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:09:17] FIRING: [13x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:00] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7780/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214509 (owner: 10Majavah) [13:10:00] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:11:21] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7779/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah) [13:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:14:17] FIRING: [12x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T410589)', diff saved to https://phabricator.wikimedia.org/P86376 and previous config saved to /var/cache/conftool/dbconfig/20251203-131435-ladsgroup.json [13:14:39] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:14:40] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2229.codfw.wmnet with reason: Maintenance [13:14:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86377 and previous config saved to /var/cache/conftool/dbconfig/20251203-131448-ladsgroup.json [13:16:32] (03PS4) 10Arnaudb: gerrit: rsync logic extraction from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) [13:16:32] (03CR) 10Arnaudb: "This change will allow running the file transfer logic between gerrit instances. It will also simplify double checking transfer between ho" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:16:50] (03CR) 10KartikMistry: [C:03+2] Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [13:18:02] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:18:29] (03CR) 10Majavah: [C:04-1] "You don't need this anymore, we fixed the network MTU instead" [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto) [13:18:32] (03Merged) 10jenkins-bot: Update rec-api to 2025-12-02-200719-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214195 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [13:19:17] RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86378 and previous config saved to /var/cache/conftool/dbconfig/20251203-131945-marostegui.json [13:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:20:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:22:42] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:24:47] (03CR) 10Jelto: "that's great news! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto) [13:24:53] (03Abandoned) 10Jelto: devtools hiera: set gitlab::runner::docker set MTU to 1450 [puppet] - 10https://gerrit.wikimedia.org/r/1214507 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto) [13:25:20] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:28:52] (03PS1) 10Jelto: gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742) [13:30:09] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:31:02] (03CR) 10Majavah: [C:03+1] gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto) [13:32:14] !log Updated Recommendation API to 2025-12-02-200719-production (T408845, T408844, T384485) [13:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] T408845: Visual indicator that an article in a list is part of a nominated collection - https://phabricator.wikimedia.org/T408845 [13:32:21] T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844 [13:32:21] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [13:34:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P86379 and previous config saved to /var/cache/conftool/dbconfig/20251203-133452-marostegui.json [13:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:35:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:57] (03CR) 10Jelto: [C:03+2] gitlab-runners hiera: remove custom MTU [puppet] - 10https://gerrit.wikimedia.org/r/1214513 (https://phabricator.wikimedia.org/T405742) (owner: 10Jelto) [13:46:49] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605) [13:46:53] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336) [13:50:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86380 and previous config saved to /var/cache/conftool/dbconfig/20251203-135000-marostegui.json [13:50:05] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:50:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:50:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2239.codfw.wmnet with reason: Maintenance [13:51:34] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:24] (03CR) 10Jelto: [C:03+1] "lgtm, one comment in line. It might makes sense to create a more generic alert somewhere in https://gerrit.wikimedia.org/r/plugins/gitiles" [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [13:58:20] 06SRE, 06collaboration-services, 10vrts, 10Znuny, and 2 others: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11428300 (10Jelto) The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&timezone... [13:58:28] (03CR) 10Jelto: [C:03+1] "The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&tim" [puppet] - 10https://gerrit.wikimedia.org/r/1214129 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [13:58:36] (03PS14) 10Arnaudb: gerrit: rsync logic extraction from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) [13:58:36] (03CR) 10Arnaudb: "see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214466/comments/51d75323_218ff349" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1400). [14:00:05] stephanebisson and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] o/ [14:00:43] o/ [14:00:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:00:51] I can start [14:00:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:58] o/ [14:01:01] ok! [14:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:09:24] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:57] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: add files for configuring conftool [puppet] - 10https://gerrit.wikimedia.org/r/1214524 [14:14:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11428392 (10Jclark-ctr) @BTullis both drives have been replaced [14:14:26] (03CR) 10CI reject: [V:04-1] kubernetes::deployment_server: add files for configuring conftool [puppet] - 10https://gerrit.wikimedia.org/r/1214524 (owner: 10Giuseppe Lavagetto) [14:14:33] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251201 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214036 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:15:05] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]] [14:15:10] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [14:15:11] T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844 [14:16:37] (03CR) 10FNegri: [C:03+1] interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah) [14:16:44] (03CR) 10Majavah: [V:03+1 C:03+2] interface::rule: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214508 (owner: 10Majavah) [14:17:16] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:56] !log sbisson@deploy2002 sbisson: Continuing with sync [14:25:06] (03PS1) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214526 (https://phabricator.wikimedia.org/T409528) [14:27:06] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214036|CX3 Build 1.0.0+20251201 (T408842 T408844)]] (duration: 12m 01s) [14:27:10] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [14:27:11] T408844: Inform that an article is part of a nominated collection on Confirmation view - https://phabricator.wikimedia.org/T408844 [14:27:27] !log push pfw policies - T411566 [14:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:28:36] (03Merged) 10jenkins-bot: Set Flow to read-only everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:29:09] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]] [14:29:12] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:31:25] !log esanders@deploy2002 esanders: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:31:36] (03CR) 10Ayounsi: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [14:32:04] (03CR) 10Majavah: firewall: Use virtual resources to fix ordering issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [14:32:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11428437 (10IBerker-WMF) I approve. [14:33:41] !log esanders@deploy2002 esanders: Continuing with sync [14:34:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders) [14:35:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders) [14:38:38] (03CR) 10CI reject: [V:04-1] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [14:38:53] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213501|Set Flow to read-only everywhere (T402552)]] (duration: 09m 44s) [14:38:57] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:39:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 21.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:40:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders) [14:40:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders) [14:41:17] FIRING: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:47] (03Merged) 10jenkins-bot: DiscussionTools: cleanup unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212161 (owner: 10Esanders) [14:41:50] (03Merged) 10jenkins-bot: Remove wgVisualEditorEditCheckSingleCheckMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164178 (owner: 10Esanders) [14:42:19] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]] [14:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 21.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:44:38] !log esanders@deploy2002 esanders: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:45:04] !log esanders@deploy2002 esanders: Continuing with sync [14:46:17] RESOLVED: ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:17] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:03] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212161|DiscussionTools: cleanup unused config]], [[gerrit:1164178|Remove wgVisualEditorEditCheckSingleCheckMode]] (duration: 06m 44s) [14:50:52] (03CR) 10Muehlenhoff: [C:03+1] "Looks good feature-wise. We might need some fine-tuning for the dialogue in terms of clarity for the user, but we can figure that out when" [software/bitu] - 10https://gerrit.wikimedia.org/r/1196919 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [14:51:59] (03CR) 10Reedy: "Seems to have caused T411632" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214201 (https://phabricator.wikimedia.org/T400023) (owner: 10Krinkle) [14:53:17] RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:46] !log UTC afternoon backport+config window done [14:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:00] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:58:41] (03PS1) 10Krinkle: robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) [14:58:47] (03CR) 10JHathaway: [C:03+2] admin: add fido backed ssh keys for jhathaway [puppet] - 10https://gerrit.wikimedia.org/r/1214169 (owner: 10JHathaway) [15:00:00] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:03] !log alert1002 port migration now starting [15:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500) [15:00:16] (03PS6) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [15:00:17] (03PS1) 10Slyngshede: C:mtail backend requests ttfb [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [15:00:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on alert1002.wikimedia.org with reason: C/D Migration [15:00:58] (03CR) 10Jforrester: [C:03+1] "Oops. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle) [15:02:32] (03PS2) 10Klausman: installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) [15:02:59] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605) (owner: 10Jforrester) [15:03:30] (03PS1) 10Ayounsi: Tox: remove old python support [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532 [15:03:40] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [15:03:48] (03CR) 10Dpogorzelski: [C:03+1] installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) (owner: 10Klausman) [15:03:57] (03CR) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [15:04:20] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [15:04:20] (03PS4) 10Ayounsi: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 [15:04:56] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-11-14-022545 / 2025-11-17-175029 to 2025-12-03-005631 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214517 (https://phabricator.wikimedia.org/T410605) (owner: 10Jforrester) [15:06:03] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:06:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11428624 (10RobH) 05Open→03Resolved This migration was completed just know with no issues. Thanks to both @Jclark-ctr and @herron for the on-site part and the icin... [15:06:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [15:06:51] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:07:03] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:08:08] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:08:17] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:09:04] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:09:40] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336) (owner: 10Jforrester) [15:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:50] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-26-175208 to 2025-12-02-224740 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214518 (https://phabricator.wikimedia.org/T411336) (owner: 10Jforrester) [15:12:22] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:12:31] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11428643 (10Jclark-ctr) 05Open→03Resolved Resolved the remaining. [15:12:41] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411194#11428647 (10Jclark-ctr) [15:12:42] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:13:57] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:14:27] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:15:10] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:15:42] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:16:14] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [15:16:37] jouncebot: nowandnext [15:16:37] For the next 0 hour(s) and 43 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500) [15:16:37] In 0 hour(s) and 13 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1530) [15:16:52] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [15:16:54] Amir1: We're not MW-facing, go for it on your end. [15:17:36] <3 [15:17:47] (03CR) 10Ladsgroup: [C:03+2] Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [15:17:48] (03PS3) 10Jforrester: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) [15:17:53] (03CR) 10Jforrester: [C:03+2] wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester) [15:17:54] (03CR) 10Cathal Mooney: [C:03+1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [15:18:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [15:18:54] (03Merged) 10jenkins-bot: Clean up db groups config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [15:19:26] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]] [15:19:29] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [15:19:30] (03PS1) 10Jforrester: wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111) [15:19:56] (03Merged) 10jenkins-bot: wikifunctions: Set FUNCTION_EVALUATOR_WASI_ACQUIRE_TIMEOUT to 1.5s down from 3s default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204607 (https://phabricator.wikimedia.org/T408977) (owner: 10Jforrester) [15:20:36] (03CR) 10Elukey: [C:03+1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [15:20:38] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:20:52] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:21:08] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:21:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11428720 (10RobH) Day 12 Update (in progress, will edit as day progresses): * alert1002 migration complete * 306 of 308 hosts migrated. * lvs1019 will migrat... [15:21:41] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:21:54] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:21:54] (03CR) 10Elukey: [C:03+1] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway) [15:22:08] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:23:07] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:23:07] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:23:14] (03CR) 10JHathaway: [C:03+2] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [15:23:40] (03CR) 10Jforrester: [C:03+2] wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111) (owner: 10Jforrester) [15:24:23] (03CR) 10JHathaway: [C:03+2] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway) [15:24:29] (03CR) 10CI reject: [V:04-1] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway) [15:24:52] (03PS1) 10Ayounsi: inter.link: add DDoS scrubbing community to all v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1214537 (https://phabricator.wikimedia.org/T407959) [15:25:23] (03PS2) 10Ayounsi: inter.link: add DDoS scrubbing community to all v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1214537 (https://phabricator.wikimedia.org/T407959) [15:25:33] (03Merged) 10jenkins-bot: wikifunctions: Set orchestrator maxSimultaneousExecutions to 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214536 (https://phabricator.wikimedia.org/T409111) (owner: 10Jforrester) [15:26:08] (03PS2) 10JHathaway: reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 [15:26:22] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:26:41] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:26:43] (03CR) 10JHathaway: [V:03+2 C:03+2] reimage: default to UUID rather than Option 82 [cookbooks] - 10https://gerrit.wikimedia.org/r/1214152 (owner: 10JHathaway) [15:26:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf2006.codfw.wmnet [15:27:00] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:27:04] (03PS3) 10Klausman: installserver/partman: Add custom recipe for ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) [15:27:13] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211664|Clean up db groups config (T411088)]] (duration: 07m 48s) [15:27:16] T411088: Clean up groups config - https://phabricator.wikimedia.org/T411088 [15:27:35] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:27:43] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:27:50] (03CR) 10Muehlenhoff: [C:03+2] Switch conf2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214396 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:28:15] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:30:06] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1500) [15:30:06] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1530) [15:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:32:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf2006.codfw.wmnet [15:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [15:43:35] (03CR) 10Elukey: [C:03+1] UEFI: dup partition on MD RAID boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [15:47:27] (03CR) 10Elukey: [C:03+1] "I am fine with this, we could even think about replacing py39/310 with 312/313 and see how it goes, to be more future proof. It can be don" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532 (owner: 10Ayounsi) [15:50:08] (03CR) 10JHathaway: [C:03+2] UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [15:55:03] (03PS4) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [15:55:03] (03PS5) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [15:55:03] (03PS5) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [15:55:04] (03PS5) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [15:55:05] (03PS1) 10Majavah: ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 [15:56:56] (03PS1) 10Muehlenhoff: Switch conf2005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214550 (https://phabricator.wikimedia.org/T349619) [15:59:25] FIRING: SystemdUnitFailed: dup-uefi.service on cloudelastic1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:10] (03PS5) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [16:00:10] (03PS6) 10Majavah: wmflib: hosts2ips: Allow passing in IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/1211650 [16:00:10] (03PS2) 10Majavah: ferm: Only collect resources when ensure is present [puppet] - 10https://gerrit.wikimedia.org/r/1214549 [16:00:11] (03PS6) 10Majavah: firewall: Declare resources for both providers [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) [16:00:12] (03PS6) 10Majavah: P:wmcs::instance: Convert to firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1211652 (https://phabricator.wikimedia.org/T411089) [16:03:16] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [16:03:16] PROBLEM - Thanos swift https on thanos-fe1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [16:03:26] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [16:04:27] (03PS1) 10Muehlenhoff: Switch conf2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214553 (https://phabricator.wikimedia.org/T349619) [16:05:06] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:05:10] RECOVERY - Thanos swift https on thanos-fe1006 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 4.034 second response time https://wikitech.wikimedia.org/wiki/Thanos [16:05:40] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11429011 (10Mvolz) >>! In T345627#11409512, @elukey wrote: > @Mvolz all merged, the new dashbo... [16:06:17] (03CR) 10Majavah: nftables::service: Improve src/dst filter handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [16:07:34] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1214556 (https://phabricator.wikimedia.org/T311407) [16:07:59] (03CR) 10Majavah: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [16:09:00] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [16:09:25] FIRING: [2x] SystemdUnitFailed: dup-uefi.service on cirrussearch1124:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:40] jouncebot nowandnext [16:09:40] No deployments scheduled for the next 1 hour(s) and 50 minute(s) [16:09:40] In 1 hour(s) and 50 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [16:09:40] In 1 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [16:09:59] !log dancy@deploy2002 Installing scap version "4.229.0" for 164 host(s) [16:10:33] (03PS1) 10Muehlenhoff: Switch conf1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214557 (https://phabricator.wikimedia.org/T349619) [16:10:48] (03CR) 10Majavah: service::catalog: add gerrit-https and gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [16:11:59] (03PS1) 10Muehlenhoff: Switch conf1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214558 (https://phabricator.wikimedia.org/T349619) [16:12:41] (03PS7) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [16:13:06] (03PS1) 10Muehlenhoff: Switch conf1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214561 (https://phabricator.wikimedia.org/T349619) [16:13:54] !log dancy@deploy2002 Installation of scap version "4.229.0" completed for 164 hosts [16:15:09] !log disabling unused former cloudcephosd hosts on cloud switches T410989 [16:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:12] T410989: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989 [16:16:00] (03PS1) 10Bking: opensearch-operator: push dummy chart update [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) [16:19:25] FIRING: [3x] SystemdUnitFailed: dup-uefi.service on cirrussearch1119:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:37] (03PS1) 10JHathaway: UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) [16:19:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [16:20:24] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429115 (10cmooney) a:05cmooney→03None Ok I've disabled all the unused ports on the cloud switches now. The one exception is for cloudcepho... [16:21:12] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429120 (10cmooney) DC-Ops folks we can now remove these superflous cables from the racks, and once removed delete the cable in Netbox too. Thi... [16:21:22] (03CR) 10CDanis: service::catalog: add gerrit-https and gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [16:22:23] (03PS2) 10JHathaway: UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) [16:22:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [16:24:20] (03PS1) 10Muehlenhoff: Only select Puppet version based on the Debian release (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) [16:24:26] (03PS4) 10Arnaudb: gerrit: unmask service & disable backup temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) [16:24:26] (03CR) 10Arnaudb: "this change and the next one are designed to be merged after puppet is disabled on all Gerrit instances." [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [16:25:37] (03CR) 10Vgutierrez: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [16:26:15] (03PS3) 10Arnaudb: gerrit: Switchover gerrit1003 → gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) [16:26:15] (03CR) 10Arnaudb: "this change is designed to be merged after puppet is disabled on all Gerrit instances." [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [16:26:25] (03PS1) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469) [16:27:14] (03CR) 10CI reject: [V:04-1] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [16:27:17] (03CR) 10JHathaway: [C:03+2] UEFI: remove dup timer on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1214563 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [16:27:20] (03PS3) 10Arnaudb: gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) [16:27:20] (03CR) 10Arnaudb: "this change is designed to be merged once the switchover is done. It will enable backups again on what will then be the primary instance." [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [16:31:01] (03PS1) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469) [16:31:56] (03CR) 10CI reject: [V:04-1] [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [16:32:55] (03PS1) 10Urbanecm: [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) [16:33:43] (03CR) 10CI reject: [V:04-1] [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [16:35:25] (03PS2) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469) [16:37:30] (03PS2) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 [16:37:40] 10ops-codfw, 06SRE, 06DC-Ops: codfw: cleanup Interface enabled but not connected alert - https://phabricator.wikimedia.org/T411195#11429197 (10Jhancock.wm) 05Open→03Resolved all ports verified empty and removed from netbox [16:38:17] (03CR) 10CI reject: [V:04-1] etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [16:38:57] jouncebot: nowandnext [16:38:57] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [16:38:57] In 1 hour(s) and 21 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [16:38:57] In 1 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [16:39:11] (03PS3) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 [16:39:17] (03CR) 10Muehlenhoff: etcd: Remove the use_pki_certs flag [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [16:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:40:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211191 (owner: 10BryanDavis) [16:40:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211192 (owner: 10BryanDavis) [16:41:21] (03Merged) 10jenkins-bot: officewiki: Put indicators in title with vector-2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211191 (owner: 10BryanDavis) [16:41:23] (03Merged) 10jenkins-bot: officewiki: Enable page protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211192 (owner: 10BryanDavis) [16:41:52] sukhe: ^^ any recent changes to druid-public-coordinator? [16:41:54] !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]] [16:42:02] /cc btullis [16:42:12] vgutierrez: nope. the last patch attempt failed so we reverted it [16:42:31] will check after the meeting [16:42:36] did you clear the alerts after that? [16:44:00] I am not sure if this is related to that so I will need to check [16:44:02] will do so [16:44:06] but no, I did not clear the alerts [16:44:22] !log bd808@deploy2002 bd808: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:45:03] yeah.. that alert is 16 days old [16:45:38] !log bd808@deploy2002 bd808: Continuing with sync [16:46:07] but it's still being triggered at the moment [16:46:09] 025-12-03T16:45:21.997079+00:00 config-master2001 confd[88705]: 2025-12-03T16:45:21Z config-master2001 /usr/bin/confd[88705]: ERROR "failed linting '/usr/local/bin/pybal-eval-check /srv/config-master/pybal/eqiad/.druid-public-coordinator1210973826' with 1 (0.02312159538269043s) [invalid]: server pool cannot be empty!\n\nupdating error mtime on [16:46:09] /var/run/confd-template/_srv_config-master_pybal_eqiad_druid-public-coordinator.err\n" [16:47:31] 06SRE, 10SRE-Access-Requests: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648 (10BTullis) 03NEW [16:47:44] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429236 (10BTullis) a:03BTullis [16:48:44] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429244 (10Ahoelzl) Approved. [16:49:41] !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1211191|officewiki: Put indicators in title with vector-2022]], [[gerrit:1211192|officewiki: Enable page protection indicators]] (duration: 07m 47s) [16:49:51] (03PS1) 10Btullis: Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648) [16:49:55] RESOLVED: [5x] SystemdUnitFailed: dup-uefi.service on cirrussearch1119:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:16] (03PS1) 10Btullis: Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648) [16:53:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle) [16:53:18] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429266 (10BTullis) [16:53:33] Krinkle: ^^ sending that robots.txt fix out [16:53:55] (03CR) 10Ahoelzl: [V:03+1] Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis) [16:54:08] (03Merged) 10jenkins-bot: robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214529 (https://phabricator.wikimedia.org/T411632) (owner: 10Krinkle) [16:54:09] (03CR) 10Ahoelzl: [V:03+1] Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis) [16:54:40] !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]] [16:54:43] T411632: PHP Warning: Undefined array key "enabled" - https://phabricator.wikimedia.org/T411632 [16:55:11] (03CR) 10Btullis: [C:03+2] Add more approvers for analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214576 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis) [16:55:18] (03CR) 10Btullis: [C:03+2] Add ebysans to the analytics-platform-eng-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1214578 (https://phabricator.wikimedia.org/T411648) (owner: 10Btullis) [16:56:50] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: Requesting access to the analytics-platform-eng-admins POSIX group for Sandra Ebele Nwoko - https://phabricator.wikimedia.org/T411648#11429277 (10BTullis) 05Open→03Resolved [16:57:07] !log bd808@deploy2002 bd808, krinkle: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:58:17] !log bd808@deploy2002 bd808, krinkle: Continuing with sync [16:58:24] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429297 (10Jhancock.wm) the four servers in codfw have had cables physically removed and deleted in netbox. [17:01:23] bd808: thx, want me to test or are you? [17:02:20] !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214529|robots.php: Fix undefined index 'enabled' on Wikinews and closed wikis (T411632)]] (duration: 07m 40s) [17:02:23] T411632: PHP Warning: Undefined array key "enabled" - https://phabricator.wikimedia.org/T411632 [17:02:24] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11429343 (10Jhancock.wm) @Dwisehaupt two network connections have now been provisioned. lmk if you need anything else =) [17:02:41] Krinkle: I did a quick test that robots.txt still rendered before I sent it the rest of the way, but if you can watch to make sure the error stops that would be swell. [17:02:56] yep, error is gone on https://en.wikinews.org/robots.txt?_1235 after the change on mwdebug [17:03:03] no more logspam [17:04:07] PROBLEM - Host db1229 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:05:36] checking [17:06:01] RECOVERY - Host db1229 #page is UP: PING WARNING - Packet loss = 50%, RTA = 284.91 ms [17:06:33] did it crash? [17:06:38] hmmh, can't reach the host even on the mgmt [17:06:42] I can [17:06:47] it rebooted [17:06:56] PROBLEM - MariaDB read only s2 on db1229 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:06:59] let's depool [17:07:07] (03Abandoned) 10Urbanecm: [Growth] Enable Add Link backend on a handful of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214568 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [17:07:08] prob hw issue [17:07:12] (03PS4) 10Urbanecm: [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) [17:07:42] (03CR) 10Urbanecm: "> https://integration.wikimedia.org/ci/job/operations-mw-config-php83-composer-diffConfig/76/console : FAILURE No change detected against " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [17:07:46] !log jynus@cumin1003 dbctl commit (dc=all): 'Depooldb1229', diff saved to https://phabricator.wikimedia.org/P86383 and previous config saved to /var/cache/conftool/dbconfig/20251203-170745-jynus.json [17:07:56] PROBLEM - mysqld processes on db1229 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:07:57] PROBLEM - MariaDB Replica SQL: s2 #page on db1229 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:07:58] PROBLEM - MariaDB Replica IO: s2 #page on db1229 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:08:09] mmm [17:08:11] I will ack or downtime [17:08:17] looks like a hw crash [17:08:21] yeah [17:08:23] and file a ticket [17:08:33] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11429363 (10elukey) @Mvolz ahhh ok thanks for the explanation! I rechecked the graph and it sh... [17:08:37] I am depooling [17:09:05] I did it already [17:09:07] see backlog [17:09:09] oh! [17:09:10] thanks! [17:10:38] there's a broken DIMM at B7 [17:10:49] I'll open a task for DC ops to get it swapped [17:11:01] thank you moritzm tag DBA if you can! [17:11:05] please add the info to 411652 [17:11:10] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1229.eqiad.wmnet with reason: crashed [17:11:13] moritzm: ^ [17:11:23] ah, thx [17:11:51] T411652 [17:11:51] T411652: db1229 crashed - https://phabricator.wikimedia.org/T411652 [17:11:52] (03PS1) 10Sbisson: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485) [17:13:12] 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429385 (10MoritzMuehlenhoff) [17:14:17] jouncebot now [17:14:17] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [17:14:22] jouncebot next [17:14:22] In 0 hour(s) and 45 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [17:14:22] In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [17:15:01] Hi, any chance I can do an emergency backport for Content Translation right now? [17:15:23] 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429388 (10jcrespo) [17:15:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11429389 (10ssingh) >>! In T408892#11426444, @Papaul wrote: > @ssingh yes we have to depool the site, yes 10 AM CT Thanks, that works. Will send an invite. [17:16:40] 10ops-eqiad, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429390 (10jcrespo) [17:16:54] (03PS1) 10Marostegui: db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214581 (https://phabricator.wikimedia.org/T411652) [17:17:01] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429391 (10cmooney) >>! In T410989#11429297, @Jhancock.wm wrote: > the four servers in codfw have had cables physically removed and deleted in n... [17:17:36] (03CR) 10Marostegui: [C:03+2] db1229: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1214581 (https://phabricator.wikimedia.org/T411652) (owner: 10Marostegui) [17:17:38] 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429394 (10cmooney) [17:17:48] (03PS5) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) [17:18:39] (03CR) 10Mstyles: OATHAuth: Expand 2FA to all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [17:19:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [17:19:26] given db1229 is depooled, DC ops are looped in for the eventual hardware fix and notifications are now disabled, I'd resolve the page [17:20:09] 06SRE, 06collaboration-services, 10vrts, 10Znuny, and 2 others: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11429403 (10Dzahn) +1 - it seems the cleanup job is needed [17:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:23:09] (03CR) 10Scott French: [C:03+1] "Thanks for the additional discussion, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [17:25:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:48] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:26:40] FIRING: ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#commons.wikimedia.org:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:26:46] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:28:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:29:10] FIRING: [2x] BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:29:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:29:43] (03CR) 10Dzahn: "Thank you for the reviews. I like getting everything merged that is possible to merge without harm. I actually see a benefit in getting pa" [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:29:46] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:30:12] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp2027.codfw.wmnet, cp2035.codfw.wmnet, cp2039.codfw.wmnet, cp2037.codfw.wmnet, cp2031.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2027.codfw.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:30:16] PROBLEM - SSH on lvs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:30:58] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:31:08] RECOVERY - SSH on lvs1016 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:31:09] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251126 [extensions/ContentTranslation] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214580 (https://phabricator.wikimedia.org/T384485) (owner: 10Sbisson) [17:31:12] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:31:40] RESOLVED: ProbeDown: Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#commons.wikimedia.org:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:31:41] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]] [17:32:01] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [17:32:38] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:33:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:33:54] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:34:10] RESOLVED: [3x] BFDdown: BFD session down between cr2-magru and 195.200.68.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:34:36] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:35:19] Multiple reports of enwiki being unreachable, various errors [17:35:28] AntiComposite: thanks, known. [17:35:58] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:36:44] !log sbisson@deploy2002 sbisson: Continuing with sync [17:37:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-magru.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:39:53] FIRING: [2x] DDoSDetected: FastNetMon has detected an attack on esams #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [17:40:48] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214580|CX3 Build 1.0.0+20251126 (T384485)]] (duration: 09m 07s) [17:40:51] T384485: Recommendation API: Support pagination for single page collection recommendations - https://phabricator.wikimedia.org/T384485 [17:40:54] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS trixie [17:41:44] FIRING: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:42:46] FIRING: [2x] Primary inbound port utilisation over 80% #page: Device cr2-magru.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:43:16] (03PS1) 10Cparle: Feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) [17:44:34] (03PS1) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [17:44:53] FIRING: [6x] DDoSDetected: FastNetMon has detected an attack on eqiad #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [17:45:58] !log sukhe@cumin1003 START - Cookbook sre.network.cf [17:46:00] !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:46:08] !log sukhe@cumin1003 START - Cookbook sre.network.cf [17:46:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:46:12] !log sukhe@cumin1003 START - Cookbook sre.network.cf [17:46:19] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [17:46:29] !log sukhe@cumin1003 START - Cookbook sre.network.cf [17:46:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [17:46:40] (03PS2) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [17:46:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:46:54] (03PS3) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [17:47:17] (03PS1) 10Majavah: Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 [17:47:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:47:46] (03PS1) 10Ssingh: sites: set prepend_as_out to true [homer/public] - 10https://gerrit.wikimedia.org/r/1214587 [17:48:05] (03CR) 10Ssingh: [C:03+1] Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah) [17:48:10] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah) [17:48:23] (03Abandoned) 10Ssingh: sites: set prepend_as_out to true [homer/public] - 10https://gerrit.wikimedia.org/r/1214587 (owner: 10Ssingh) [17:48:44] (03CR) 10Majavah: [C:03+2] Prepepdn eqiad/eqsin/drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1214586 (owner: 10Majavah) [17:49:53] RESOLVED: [6x] DDoSDetected: FastNetMon has detected an attack on eqiad #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [17:50:47] (03PS1) 10Ladsgroup: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953) [17:50:58] (03PS1) 10Ladsgroup: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953) [17:50:59] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1006.eqiad.wmnet with OS trixie [17:51:06] jouncebot: nowandnexr [17:51:07] jouncebot: nowandnext [17:51:08] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [17:51:08] In 0 hour(s) and 8 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [17:51:08] In 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [17:54:50] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:55:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:57:03] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:01:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [18:01:20] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:01:42] jouncebot: refresh [18:01:42] I refreshed my knowledge about deployments. [18:01:48] jouncebot: nowandnext [18:01:48] For the next 0 hour(s) and 58 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:01:48] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:01:48] In 2 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100) [18:01:53] Whut. [18:02:04] :D [18:02:12] That's wrong. [18:02:20] "Wikifunctions Services UTC Afternoon" was three hours ago. [18:03:11] The time stamps on the wiki look right? [18:03:16] Oh! [18:04:11] jouncebot: refresh [18:04:12] I refreshed my knowledge about deployments. [18:04:16] jouncebot: nowandnext [18:04:16] For the next 0 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:04:16] In 2 hour(s) and 55 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100) [18:04:19] Better. [18:04:58] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating for cloudceph to codfw - jhancock@cumin1003" [18:05:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating for cloudceph to codfw - jhancock@cumin1003" [18:05:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:10] Is anything happening on the infra window? [18:05:17] It doesn't look like it :D [18:05:30] (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [18:05:35] (03CR) 10Ladsgroup: [C:03+2] findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [18:08:44] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [18:10:55] (03Merged) 10jenkins-bot: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214592 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [18:11:00] (03Merged) 10jenkins-bot: findBadBlobs: Fix the --scan-to option [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214593 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [18:12:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1006.eqiad.wmnet with reason: host reimage [18:17:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429625 (10Jclark-ctr) @MoritzMuehlenhoff Drive has been Replaced [18:19:54] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] [18:19:57] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [18:20:17] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429645 (10Jclark-ctr) ` NAME SIZE MODEL SERIAL PATH sda 894.3G Micron_5400_MTFDDAK960TGA 24144807E580 /dev/sda ├─sda1... [18:22:14] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:22:37] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [18:24:28] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429667 (10Jclark-ctr) ` [Wed Dec 3 18:16:03 2025] ata7.00: detaching (SCSI 6:0:0:0) [Wed Dec 3 18:16:03 2025] sd 6:0:0:0: [sdb] Synchronizing SCSI cache [Wed Dec 3 18:16:03 2025] sd 6:0:0:0: [sdb] Synch... [18:24:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429670 (10Jclark-ctr) a:03Jclark-ctr [18:25:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11429672 (10KFrancis) Hi @andrea.denisse, Caro Medelius (cmedelius@wikimedia.org) is already a WMF employee. The NDA is covered under their employment agreement with the WMF. [18:25:23] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1020.eqiad.wmnet with reason: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad [18:26:42] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] (duration: 06m 48s) [18:26:45] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [18:30:53] (03CR) 10Dzahn: [C:03+2] "lgtm, deploying soon" [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [18:30:58] (03PS4) 10Santiago Faci: LabsService: Rename mpic-next domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [18:32:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11429707 (10MoritzMuehlenhoff) Thanks, I'll rebuild the software RAID tomorrow [18:32:50] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:33:25] jouncebot: nowandnext [18:33:26] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T1800) [18:33:26] In 2 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100) [18:34:08] scap config does not have a php_version version variable anymore now. but it was removed in scap itself https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1021 [18:34:49] out of abundance of caution.. mentioning it for next scap run [18:35:00] (03CR) 10Cathal Mooney: [C:03+2] lvs1020: move row C vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207877 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney) [18:35:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:36:55] (03CR) 10CDobbins: sre.loadbalancer: patch to fix reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [18:37:02] 10ops-eqiad, 06SRE, 06DC-Ops: Reclaim components from decommed servers - https://phabricator.wikimedia.org/T411533#11429730 (10VRiley-WMF) 05Open→03Resolved Swapped 6 x 1.6TB with 1.9 TB SSDs Reclaimed 8 x 32 gig pc4-2666 4 x 750w power supplies 10 x 32 gig pc4-3200 I was unable to swap some memor... [18:37:03] (03PS8) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [18:37:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS trixie [18:45:16] (03CR) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [18:52:18] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11429823 (10JKelsoteel-WMF) Hey @jhathaway - thanks for your input! I shared these points with Noah as well, and we were able to... [18:52:47] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11429824 (10taavi) 05Open→03Declined [18:54:51] (03PS1) 10Cathal Mooney: lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) [18:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:55:27] (03PS1) 10Dzahn: releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612 [18:55:59] (03CR) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [18:56:21] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7782/console" [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney) [18:56:36] (03CR) 10BCornwall: [V:03+1 C:03+1] lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney) [18:57:18] (03CR) 10Catrope: [C:03+1] OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [18:59:02] (03PS1) 10Andrew Bogott: cloudservices1006: use new yaml-based pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217) [18:59:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [18:59:04] (03PS1) 10Andrew Bogott: pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) [18:59:21] (03PS1) 10Santiago Faci: wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) [18:59:38] (03CR) 10Andrew Bogott: [C:03+2] cloudservices1006: use new yaml-based pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [19:00:05] (03CR) 10CI reject: [V:04-1] wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [19:00:20] (03CR) 10Cathal Mooney: [C:03+2] lvs interfaces: fix error in quoting new vlan ids [puppet] - 10https://gerrit.wikimedia.org/r/1214611 (https://phabricator.wikimedia.org/T405609) (owner: 10Cathal Mooney) [19:00:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429856 (10BCornwall) [19:04:06] (03PS2) 10Andrew Bogott: pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) [19:04:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214613 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [19:04:40] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [19:06:43] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS bullseye [19:06:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86387 and previous config saved to /var/cache/conftool/dbconfig/20251203-190644-ladsgroup.json [19:06:48] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [19:06:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11429884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@c... [19:11:37] (03PS4) 10Kgraessle: Enable revertrisk filters in thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) [19:12:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207923 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [19:12:58] 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11429917 (10VRiley-WMF) a:03VRiley-WMF [19:14:56] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS trixie [19:15:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:19:35] (03CR) 10Hashar: [C:04-1] Ease configuration of the motd banner (032 comments) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [19:20:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:20:59] (03PS5) 10Hashar: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 [19:21:04] (03PS1) 10Cathal Mooney: Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 [19:21:22] (03CR) 10Hashar: [C:03+2] Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [19:21:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage [19:21:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86388 and previous config saved to /var/cache/conftool/dbconfig/20251203-192152-ladsgroup.json [19:22:10] !log disabling remote announcement of bgp prefixes [19:22:10] (03CR) 10Ssingh: [C:03+1] Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney) [19:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:15] (03Merged) 10jenkins-bot: Ease configuration of the motd banner [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1194221 (owner: 10Hashar) [19:22:17] !log cmooney@cumin1003 START - Cookbook sre.network.cf [19:22:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.cf (exit_code=0) [19:22:33] (03CR) 10Cathal Mooney: [C:03+2] Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney) [19:22:58] !log hashar@deploy2002 Started deploy [gerrit/gerrit@93bde2a]: Ease configuration of the motd banner [19:23:07] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@93bde2a]: Ease configuration of the motd banner (duration: 00m 09s) [19:23:46] (03Merged) 10jenkins-bot: Revert "Prepepdn eqiad/eqsin/drmrs" [homer/public] - 10https://gerrit.wikimedia.org/r/1214618 (owner: 10Cathal Mooney) [19:25:02] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [19:28:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11429987 (10Jclark-ctr) Dell ticket opened Service request 219590203 [19:28:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage [19:29:58] FIRING: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:30:04] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 105.40 ms [19:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:32:58] FIRING: [2x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from IT) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:37:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P86390 and previous config saved to /var/cache/conftool/dbconfig/20251203-193659-ladsgroup.json [19:37:58] FIRING: [5x] NELByCountryHigh: Elevated Network Error Logging events (tcp.address_unreachable from ES) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:38:46] (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1214109 (https://phabricator.wikimedia.org/T396166) (owner: 10Ahmon Dancy) [19:38:52] (03PS1) 10D3r1ck01: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) [19:39:07] (03PS1) 10D3r1ck01: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) [19:39:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95152299 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:40:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [19:40:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [19:42:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Recommendation-API: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11430063 (10Jclark-ctr) a:05Jclark-ctr→03None [19:42:59] RESOLVED: [3x] NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from ES) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [19:44:43] RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:44:58] RESOLVED: [2x] NELHigh: Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:51:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1020.eqiad.wmnet with OS bullseye [19:51:31] (03CR) 10Dzahn: "pinged in Slack for verification - CCing clinic duty - fyi touching admin/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight) [19:51:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin... [19:52:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T410589)', diff saved to https://phabricator.wikimedia.org/P86392 and previous config saved to /var/cache/conftool/dbconfig/20251203-195207-ladsgroup.json [19:52:11] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [19:53:54] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:55:21] (03CR) 10Dzahn: "I think we need to coordinate a bit on the plans now given that the CDN thing has picked up speed now and the remaining time." [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [19:55:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:55:46] (03CR) 10Dzahn: [C:03+1] gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [19:56:54] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:57:06] (03CR) 10Dzahn: [C:03+1] "+1 to the idea of monitoring this (without being able to actually test it.. can we test it? like merge and then write a high but not too " [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [19:57:07] (03CR) 10Ssingh: [C:03+2] geo-resources: add gerrit-addrs resource [dns] - 10https://gerrit.wikimedia.org/r/1214177 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:57:17] !log sukhe@dns1004 START - running authdns-update [19:58:27] !log sukhe@dns1004 END - running authdns-update [19:58:49] (03CR) 10Dzahn: [C:04-1] "let's please revisit this after the new gerrit-lb has been setup - which is happening very soon" [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [19:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:59:45] (03CR) 10Santiago Faci: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [20:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudservices1006 (172.20.1.5) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:01:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430098 (10cmooney) [20:02:04] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1211551 does the opposite - are both patches still useful now?" [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:04:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:04:46] (03CR) 10Dzahn: [C:03+1] conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [20:06:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430105 (10Jclark-ctr) [20:07:40] (03CR) 10Dzahn: "kind of duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202842 but also better since it adds both needed services - but" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [20:09:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [20:11:19] (03CR) 10Ssingh: [C:03+2] conftool-data: geodns: add gerrit-addrs [puppet] - 10https://gerrit.wikimedia.org/r/1214192 (https://phabricator.wikimedia.org/T365259) (owner: 10Ssingh) [20:14:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430132 (10cmooney) [20:16:42] (03CR) 10Ssingh: [C:03+2] dns.admin: add gerrit-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:19:13] (03PS4) 10Dzahn: service: add gerrit service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) [20:19:35] (03PS5) 10Dzahn: service: add gerrit-https service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) [20:20:33] (03CR) 10Dzahn: "since https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202842 already has reviews and I reacted to them.. and I have also heard comme" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [20:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:41] (03Merged) 10jenkins-bot: dns.admin: add gerrit-addrs resource [cookbooks] - 10https://gerrit.wikimedia.org/r/1214179 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [20:22:53] (03CR) 10Dzahn: "if you are ok with it I would rebase this on the other so it becomes just the gerrit-ssh part. that way we get them both, each of us did o" [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [20:23:45] (03Abandoned) 10Santiago Faci: wmgLocalServices: Renamed `mpic` to `test-kitchen` local service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214615 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [20:25:49] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices1005.eqiad.wmnet with OS trixie [20:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:57] (03CR) 10Andrew Bogott: [C:03+2] pdns-recursor: use yaml-based config in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1214614 (https://phabricator.wikimedia.org/T375217) (owner: 10Andrew Bogott) [20:26:27] 10ops-eqiad, 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11430168 (10VRiley-WMF) 05Open→03Resolved These cables at eqiad have been physically removed and deleted in netbox. [20:26:33] (03PS5) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [20:27:25] (03PS1) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631 [20:27:34] (03CR) 10CI reject: [V:04-1] Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) (owner: 10Santiago Faci) [20:28:08] (03PS6) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [20:29:42] (03CR) 10Ssingh: "Hi Scott. I will follow up on Monday with a tentative plan. Sorry about the delay -- we have been busy with other stuff and this got sidet" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:29:54] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:30:48] (03PS7) 10Santiago Faci: Rename `mpic` local service to `test-kitchen` because of the platform renaming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214585 (https://phabricator.wikimedia.org/T407805) [20:31:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:32:39] (03CR) 10Scott French: [C:03+1] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1214556 (https://phabricator.wikimedia.org/T311407) (owner: 10Muehlenhoff) [20:32:58] (03CR) 10SBassett: "Ok, sounds good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [20:35:16] (03CR) 10Scott French: [C:03+1] "Thanks, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1214550 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [20:35:18] (03CR) 10Scott French: [C:03+1] Switch conf2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214553 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [20:35:19] (03CR) 10Scott French: [C:03+1] Switch conf1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214557 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [20:35:20] (03CR) 10Scott French: [C:03+1] Switch conf1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214558 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [20:35:23] (03CR) 10Scott French: [C:03+1] Switch conf1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214561 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [20:36:31] (03CR) 10Dzahn: [C:03+1] "got confirmation on Slack" [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight) [20:36:32] (03PS1) 10Daniel Kinzler: rest gateway: use the new x-trusted-request header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214633 (https://phabricator.wikimedia.org/T410379) [20:38:47] (03PS2) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631 [20:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:37] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [20:41:11] (03PS3) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631 [20:42:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:58] (03PS1) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214636 [20:43:02] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1020.eqiad.wmnet [20:43:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1020.eqiad.wmnet [20:43:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11430232 (10BCornwall) [20:44:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices1005.eqiad.wmnet with reason: host reimage [20:47:13] !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8] (hadoop-test): Deploy spur hqls TEST [analytics/refinery@6dfb3b8b] [20:48:14] !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8] (hadoop-test): Deploy spur hqls TEST [analytics/refinery@6dfb3b8b] (duration: 01m 01s) [20:48:31] (03CR) 10Dzahn: [C:03+2] Regenerate awight yubikey [puppet] - 10https://gerrit.wikimedia.org/r/1214448 (owner: 10Awight) [20:49:11] !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8]: Deploy spur hqls [analytics/refinery@6dfb3b8b] [20:50:44] 06SRE, 10SRE-Access-Requests: Updating RobH ssh pubkey file to add fido backing - https://phabricator.wikimedia.org/T411678 (10RobH) 03NEW [20:50:50] (03PS1) 10RobH: RobH yubikey ssh pubkey update [puppet] - 10https://gerrit.wikimedia.org/r/1214638 (https://phabricator.wikimedia.org/T411678) [20:51:40] !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8]: Deploy spur hqls [analytics/refinery@6dfb3b8b] (duration: 02m 29s) [20:51:58] !log aqu@deploy2002 Started deploy [analytics/refinery@6dfb3b8] (thin): Deploy spur hqls THIN [analytics/refinery@6dfb3b8b] [20:52:08] (03CR) 10RobH: [C:03+2] RobH yubikey ssh pubkey update [puppet] - 10https://gerrit.wikimedia.org/r/1214638 (https://phabricator.wikimedia.org/T411678) (owner: 10RobH) [20:52:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:53:14] !log aqu@deploy2002 Finished deploy [analytics/refinery@6dfb3b8] (thin): Deploy spur hqls THIN [analytics/refinery@6dfb3b8b] (duration: 01m 16s) [20:54:22] 06SRE, 10SRE-Access-Requests: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679 (10AStein-WMF) 03NEW [20:56:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Updating RobH ssh pubkey file to add fido backing - https://phabricator.wikimedia.org/T411678#11430286 (10RobH) 05Open→03Resolved [20:56:22] 06SRE, 10SRE-Access-Requests: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430288 (10greg) As @AStein-WMF 's manager, I approve. [20:56:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [20:56:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [20:57:24] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430304 (10greg) [20:58:31] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430310 (10AStein-WMF) context slack thread: https://wikimedia.slack.com/archives/CSV483812/p1764777672488669 2 things: # this is fairly time sensitive... [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2100). [21:00:05] aude, danisztls, maryum, xSavitar, and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:23] o/ [21:00:27] o/ [21:02:25] my change is low risk and can be batched with others [21:02:26] i can deploy config patches [21:02:36] hi! I can also deploy my own with spiderpig [21:03:02] I can self-deploy as well. Mine is also low risk, just increasing surveys coverage. [21:03:09] ok, then i will do just mine [21:03:12] starting [21:03:54] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:04:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:04:18] aude, go for it. I can self-service when it's time. [21:04:24] ok [21:04:24] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214642 [21:05:01] (03Merged) 10jenkins-bot: [Legal Footer] Create config for adding legal footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208380 (https://phabricator.wikimedia.org/T410163) (owner: 10LorenMora) [21:05:34] !log aude@deploy2002 Started scap sync-world: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]] [21:05:38] T410163: [Legal Footer] Create config and logic for adding legal footer - https://phabricator.wikimedia.org/T410163 [21:06:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-d5-eqiad and cloudservices1005 (172.20.2.4) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:08:06] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [21:08:11] !log aude@deploy2002 aude, lmora: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:08:25] let me know when I can go [21:08:51] checking our change [21:08:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:09:53] maryum: I can deploy yours together with mine if you want. [21:10:00] yep please go ahead [21:10:08] !log aude@deploy2002 aude, lmora: Continuing with sync [21:13:06] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [21:13:19] danisztls: you can do mine as well :) [21:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:13] !log aude@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208380|[Legal Footer] Create config for adding legal footer (T410163)]] (duration: 08m 38s) [21:14:16] T410163: [Legal Footer] Create config and logic for adding legal footer - https://phabricator.wikimedia.org/T410163 [21:14:19] we're done [21:14:50] (03PS1) 10Jforrester: Followup I81a2c4de77: Verify stats label values are not empty [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) [21:15:14] (03CR) 10Jforrester: "Proposing as a cherry-pick rather than waiting two weeks to find out if this fixes the logspam (given there's no train next week)." [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [21:15:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:15:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:16:17] (03Merged) 10jenkins-bot: Increase coverage of 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214494 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [21:16:34] (03Merged) 10jenkins-bot: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:17:04] !log dani@deploy2002 Started scap sync-world: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]] [21:17:08] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:17:09] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [21:19:40] !log dani@deploy2002 dani, mstyles: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:20:12] maryum: can you test? [21:20:21] yes I can test now? [21:20:28] maryum: yes [21:20:36] AaronSchulz: sorry, I saw your message too late [21:23:10] (03Abandoned) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214631 (owner: 10Andriy.v) [21:24:11] maryum: should I continue with sync? [21:24:17] yes please [21:24:22] !log dani@deploy2002 dani, mstyles: Continuing with sync [21:28:21] !log dani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214494|Increase coverage of 2025 Global Readers Survey (non-enwiki) (T410918)]], [[gerrit:1213585|OATHAuth: Expand 2FA to all users (T399664)]] (duration: 11m 18s) [21:28:26] T410918: Deploy 2025 Global Readers Surveys (non-English) - https://phabricator.wikimedia.org/T410918 [21:28:26] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [21:28:46] xSavitar: all yours [21:28:57] Thanks! [21:29:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11430447 (10andrea.denisse) [21:29:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [21:29:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [21:30:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11430448 (10andrea.denisse) [21:30:59] AaronSchulz, I can ping you once I'm done. Sounds good? [21:31:07] ok [21:31:55] 10ops-eqiad, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411684 (10phaultfinder) 03NEW [21:32:12] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430460 (10andrea.denisse) 05Open→03In progress p:05Triage→03High a:03andrea.denisse [21:32:42] danisztls thanks so much! [21:34:18] (03Merged) 10jenkins-bot: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1214620 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [21:34:24] (03Merged) 10jenkins-bot: User: Log where the data was loaded when CAS update failed [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214621 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [21:34:59] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]] [21:35:02] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [21:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:37:49] !log derick@deploy2002 derick, d3r1ck01: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:38:29] Nothing to verify, will monitor logs after deployment [21:38:33] !log derick@deploy2002 derick, d3r1ck01: Continuing with sync [21:40:27] (03PS2) 10Bking: opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) [21:41:30] (03CR) 10Bking: [C:03+2] opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [21:42:32] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214620|User: Log where the data was loaded when CAS update failed (T410652)]], [[gerrit:1214621|User: Log where the data was loaded when CAS update failed (T410652)]] (duration: 07m 33s) [21:42:35] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [21:42:54] AaronSchulz, over to you. I'm done! [21:43:15] * AaronSchulz goes [21:43:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:44:38] (03Merged) 10jenkins-bot: Update Math API title and project-specific /math/ endpoint stability policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214142 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:45:10] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]] [21:45:13] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:47:26] !log aaron@deploy2002 aaron: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:48:41] (03Merged) 10jenkins-bot: opensearch-operator: watch the correct namespaces in CODFW [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214562 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [21:48:54] (03PS1) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 [21:49:23] (03CR) 10CI reject: [V:04-1] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah) [21:49:34] !log aaron@deploy2002 aaron: Continuing with sync [21:50:10] (03PS2) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 [21:52:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah) [21:53:35] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214142|Update Math API title and project-specific /math/ endpoint stability policy (T411517)]] (duration: 08m 25s) [21:53:38] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:54:01] (03PS3) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 [21:55:45] done [21:57:34] (03PS1) 10Majavah: Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657 [21:58:53] (03PS2) 10Majavah: Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657 [21:59:45] (03CR) 10Majavah: [C:03+2] Revert "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214657 (owner: 10Majavah) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2200) [22:02:11] (03PS1) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) [22:03:20] (03PS2) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) [22:04:22] (03PS1) 10Majavah: Revert^2 "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214660 [22:05:53] (03CR) 10Majavah: [C:03+2] Revert^2 "pdns-recursor: use yaml-based config in eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/1214660 (owner: 10Majavah) [22:07:25] (03PS4) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 [22:08:07] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11430945 (10AStein-WMF) also tagging in @BTullis [22:08:31] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:09:09] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:11:46] (03PS5) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 [22:13:14] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:13:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah) [22:14:01] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:14:19] (03PS3) 10Bking: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) [22:14:31] (03CR) 10Bking: [C:03+2] opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [22:14:33] (03CR) 10Catrope: [C:03+1] OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [22:16:10] (03Merged) 10jenkins-bot: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) (owner: 10Bking) [22:16:42] (03CR) 10Andrew Bogott: dnsrecursor: Use additional_forward_zones with new config format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (owner: 10Majavah) [22:17:30] (03PS6) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) [22:18:08] (03CR) 10Majavah: dnsrecursor: Use additional_forward_zones with new config format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah) [22:19:28] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah) [22:21:11] (03CR) 10Andrew Bogott: [C:03+1] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah) [22:21:20] (03CR) 10Majavah: [V:03+1 C:03+2] dnsrecursor: Use additional_forward_zones with new config format [puppet] - 10https://gerrit.wikimedia.org/r/1214656 (https://phabricator.wikimedia.org/T381608) (owner: 10Majavah) [22:25:57] (03CR) 10JHathaway: firewall: Declare resources for both providers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [22:31:27] (03PS1) 10Ryan Kemper: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) [22:33:15] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:33:18] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:36:35] 10SRE-Access-Requests: Add FIDO-backed SSH key for brennen - https://phabricator.wikimedia.org/T411730 (10brennen) 03NEW [22:37:02] (03PS1) 10Brennen Bearnes: admin: add fido backed ssh key for brennen [puppet] - 10https://gerrit.wikimedia.org/r/1214665 (https://phabricator.wikimedia.org/T411730) [22:47:19] (03CR) 10Andrew Bogott: [C:03+1] openstack: puppet: Remove support for X-Enc-Edit-Git [puppet] - 10https://gerrit.wikimedia.org/r/1214490 (owner: 10Majavah) [22:48:07] (03PS5) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) [22:50:54] !log maintenance on https://codesearch.wmcloud.org/ - trying to fix disk space issue [22:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:10] !log maintenance on https://codesearch.wmcloud.org/ - trying to fix disk space issue - detaching volume to extend it [22:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:43] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to druid pageviews_hourly for astein - https://phabricator.wikimedia.org/T411679#11431064 (10Ahoelzl) Approved. [22:54:55] (03CR) 10RLazarus: [C:03+1] Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [22:55:04] (03CR) 10Dzahn: [C:04-1] "I don't think this needs shell access - it sounds like it's all about access to private data on dashboards - so that is level 1" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [22:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:28] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11431069 (10Novem_Linguae) [22:55:46] 10ops-eqiad, 06DC-Ops: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731 (10Jhancock.wm) 03NEW [22:56:24] (03CR) 10Dzahn: [C:04-1] "you can copy from one of the existing users with the line " ssh_keys: [] # Added with no SSH access, for membership in analytics-priva" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [22:57:45] (03CR) 10Dzahn: [C:04-1] "generally yet another case of https://phabricator.wikimedia.org/T405517" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [22:58:56] (03PS6) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251203T2300) [23:01:56] (03CR) 10Dzahn: [C:03+1] "yea, this should fix access on dashboards" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [23:01:59] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog, 13Patch-For-Review: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11431105 (10Dzahn) [23:02:00] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11431106 (10Dzahn) [23:02:33] (03CR) 10Andrea Denisse: "Thanks Daniel, I've updated the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [23:02:34] (03CR) 10Btullis: [C:03+1] "As I understand it, the user requires shell level access to query druid programmatically from a stat host." [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [23:04:19] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11431113 (10bking) [23:05:05] (03PS7) 10Andrea Denisse: Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) [23:05:11] (03CR) 10Jasmine: [C:03+1] "Looks good, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [23:06:06] (03CR) 10Dzahn: "well, the ticket asks for " Specifically, i'm trying to programmatically access the data in this turnilo dash " and stat hosts or other th" [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [23:08:20] !log hard rebooting codesearch9.codesearch.eqiad1.wikimedia.cloud (T411728) [23:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:23] T411728: Codesearch down/unreachable (2025-12-03) - https://phabricator.wikimedia.org/T411728 [23:12:28] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11431123 (10Dzahn) More examples: T411679 - requestor actively says they don't know the level - request gets approved regardless - discussion on actual code review if shell a... [23:13:26] Amir1: if that service is down that is a good thing :P [23:13:43] I step back, Have fun :P [23:13:46] Amir1: sounds weird. lol.. what I mean is.. I wanted to unmount the volume basically when that ticket came in [23:13:49] let me know if I can help on anything [23:14:01] yeah, I know, don't worry [23:14:10] have to unmount the volume to resize it [23:14:32] and have to resize it because there is still not enough space to make a larger volume than the existing one [23:14:39] without asking for more quota yet another time [23:14:49] I hate the resizefs part though [23:15:24] Amir1: rebooting it was great though, needed that :) [23:19:01] ok, shutting it down again for maintenance action.. [23:19:38] (03PS1) 10CDanis: support.arraynetworks.net should not trigger error alerts [puppet] - 10https://gerrit.wikimedia.org/r/1214670 (owner: 10Jdlrobson) [23:26:10] Amir1: this was actually less painful and time consuming than expected. double the size of /srv/ and with it there are plenty of inodes now. til next time [23:26:28] (only possible after the quota request) [23:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:31:33] (03PS3) 10Mstyles: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) [23:44:01] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975#11431189 (10RLazarus) Envoy 1.35.7 is about to come out, with security fixes: https://groups.google.com/g/envoy-announce/c/zr2OzwmJFqY None of these issues affect us urgently, but since we're early in the 1.35... [23:44:18] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11431190 (10RLazarus) [23:54:39] (03CR) 10RLazarus: [C:03+2] deployment_server: Write Envoy hieradata to YAML files for sophroid [puppet] - 10https://gerrit.wikimedia.org/r/1213604 (owner: 10RLazarus) [23:54:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:55:06] mutante: Thank you! [23:55:55] :) [23:59:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency