[00:01:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11117467 (10VRiley-WMF) These have been added into Netbox [00:07:25] !log Run systemctl reset-failed on disappeared nrpe2nodexp-disk_space.timer units (T395446) [00:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:30] T395446: Evaluate which solution we could adopt as a drop-in replacement for NRPE (and start prototyping) - https://phabricator.wikimedia.org/T395446 [00:08:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181811 [00:08:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181811 (owner: 10TrainBranchBot) [00:11:32] ugh, I can't get the host to shut down. This is the first time, they usually shut down when we don't want them to... [00:16:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1244.eqiad.wmnet [00:16:19] PROBLEM - MariaDB Replica Lag: s4 #page on db1244 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1948.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:16:20] * Josve05a urging myself not to do a "Have you tried turning it off and on again" joke [00:16:40] !incidents [00:16:41] 6702 (UNACKED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [00:16:41] 6699 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams) [00:16:43] !ack 6702 [00:16:44] 6702 (ACKED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [00:16:51] here [00:16:56] o/ [00:17:05] all good, it's just T402871 [00:17:06] T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871 [00:17:07] \o [00:17:28] * swfrench-wmf thumbs up [00:17:31] Amir1: downtime expired maybe? [00:18:01] rzl: sigh, this is the cookbook being an idiot and removing down time when it shouldn't [00:18:05] ahh [00:18:10] awesome [00:18:10] lovely [00:18:42] this is not even the first time [00:18:42] things just keep wanting to be up :) [00:18:55] sorry for the noise [00:19:02] no worries, thanks for all your work! [00:19:06] I have so many more gray hairs since last week [00:19:09] :( [00:19:27] You'll look distinguished, like michael tilson thomas [00:20:20] RECOVERY - MariaDB Replica Lag: s4 #page on db1244 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:20:34] <3 [00:20:41] the coolest one with gray hair is Steven Tyler [00:22:45] Steven looks like Jason Momoa and Jack Sparrow had a child who aged in reverse [00:24:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [00:24:47] lol fair but the songs are cool [00:24:53] (Aerosmith) [00:29:58] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402875 (10phaultfinder) 03NEW [00:32:49] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181811 (owner: 10TrainBranchBot) [00:34:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1210.eqiad.wmnet with reason: Maintenance [00:35:38] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2213.codfw.wmnet with reason: Maintenance [00:36:19] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1223.eqiad.wmnet with reason: Maintenance [00:39:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance [00:47:08] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance [00:47:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2229.codfw.wmnet with reason: Maintenance [00:48:18] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [00:49:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2220.codfw.wmnet with reason: Maintenance [00:49:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1222.eqiad.wmnet with reason: Maintenance [00:50:37] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2207.codfw.wmnet with reason: Maintenance [00:55:12] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1244.eqiad.wmnet with reason: Maintenance [00:59:45] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1249.eqiad.wmnet with reason: Maintenance [00:59:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P81747 and previous config saved to /var/cache/conftool/dbconfig/20250826-005952-ladsgroup.json [00:59:58] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:01:54] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to envoy-future:1.26.8-2 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181806 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [01:06:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P81748 and previous config saved to /var/cache/conftool/dbconfig/20250826-010618-ladsgroup.json [01:06:23] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:08:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.16 [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1181816 (https://phabricator.wikimedia.org/T396377) [01:08:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.16 [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1181816 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [01:08:44] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1244.eqiad.wmnet with reason: Maintenance [01:13:50] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11117537 (10phaultfinder) [01:16:51] (03PS1) 10Andrew Bogott: Horizon: use dalmatian-versioned config template on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1181817 [01:16:51] (03PS1) 10Andrew Bogott: Horizon: use dalmatian-versioned config template on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1181818 [01:18:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11117543 (10phaultfinder) [01:20:13] (03CR) 10Andrew Bogott: [C:03+2] Horizon: use dalmatian-versioned config template on codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1181817 (owner: 10Andrew Bogott) [01:20:41] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.16 [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1181816 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [01:21:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P81749 and previous config saved to /var/cache/conftool/dbconfig/20250826-012125-ladsgroup.json [01:23:12] (03CR) 10Andrew Bogott: [C:03+2] Horizon: use dalmatian-versioned config template on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1181818 (owner: 10Andrew Bogott) [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P81750 and previous config saved to /var/cache/conftool/dbconfig/20250826-013633-ladsgroup.json [01:43:13] PROBLEM - dump of x1 in codfw on backupmon1001 is CRITICAL: Last dump for x1 at codfw (db2197) taken on 2025-08-26 00:25:30 is 82 GiB, but the previous one was 68 GiB, a change of +20.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:45:30] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1181819 [01:45:34] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181820 [01:45:38] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181821 [01:51:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T391056)', diff saved to https://phabricator.wikimedia.org/P81751 and previous config saved to /var/cache/conftool/dbconfig/20250826-015141-ladsgroup.json [01:51:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:56:32] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117609 (10Novem_Linguae) The user replied "calm down" instead of making the requested change. Not a great sign. Agree that maybe a sysadmin should just make t... [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0200) [02:04:11] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117617 (10Zache) >>! In T402749#11117609, @Novem_Linguae wrote: > The user replied "calm down" instead of making the requested change. Not a great sign. Agree... [02:10:55] PROBLEM - dump of x1 in eqiad on backupmon1001 is CRITICAL: Last dump for x1 at eqiad (db1216) taken on 2025-08-26 00:00:06 is 82 GiB, but the previous one was 68 GiB, a change of +20.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:13:20] (03PS1) 10RLazarus: deployment_server: mwscript_k8s refactor [puppet] - 10https://gerrit.wikimedia.org/r/1181824 [02:16:34] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite) [02:17:12] !log on db2202 creating copy of enwiki.recentchanges for performance analysis T400696 [02:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:18] T400696: FY25-26 WE1.4.1 RecentChanges database performance improvements - https://phabricator.wikimedia.org/T400696 [02:20:48] (03CR) 10RLazarus: "Bottom of the priority list for sure -- and with commensurate review time expected! -- it was just nagging at me. I decided not to factor " [puppet] - 10https://gerrit.wikimedia.org/r/1181824 (owner: 10RLazarus) [02:40:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:41:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:42:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:45:23] !incidents [02:45:23] 6703 (UNACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [02:45:24] 6704 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [02:45:24] 6705 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [02:45:24] 6702 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [02:45:24] 6699 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams) [02:45:31] !ack 6703 [02:45:31] 6703 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [02:45:33] !ack 6704 [02:45:34] 6704 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [02:45:37] !ack 6705 [02:45:38] 6705 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [02:45:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:46:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:47:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:58:41] !incidents [02:58:41] 6705 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [02:58:42] 6704 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [02:58:42] 6703 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [02:58:42] 6702 (RESOLVED) db1244 (paged)/MariaDB Replica Lag: s4 (paged) [02:58:42] 6699 (RESOLVED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams) [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0300) [03:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:19:01] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:19:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:19:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:24:51] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 4.316 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:52:13] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117660 (10Josve05a) Only an interface sysops will be able to edit the user's specific .js pages (not a mere regular sysops as myself), but unless they act the... [03:54:26] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11117666 (10DavidBrooks) You ask clients to respect HTTP code 429 Too Many Requests. Returning to AutoWikiBrowser: the current code will simply throw a fai... [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0400) [04:01:15] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.13 (duration: 01m 11s) [04:19:01] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:21:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:13] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 2 (install4003, ...), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:49:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:49:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:50:55] (03CR) 10Pppery: "Something very suspicious is going on here." [puppet] - 10https://gerrit.wikimedia.org/r/1181820 (owner: 10Ncmonitor) [04:54:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:54:51] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11117682 (10phaultfinder) [05:19:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:19:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:23:58] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11117683 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0600) [06:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0600). [06:23:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:09] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:28:04] (03PS1) 10Abijeet Patro: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1181830 (https://phabricator.wikimedia.org/T402496) [06:34:51] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.582 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.661 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:42:22] (03CR) 10Tiziano Fogli: [C:03+1] opensearch: selectively enable cluster health check [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite) [06:45:27] (03CR) 10Tiziano Fogli: [C:03+1] cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [06:47:50] (03PS1) 10DCausse: SECURITY: declare PoolCounter settings for cirrusbuilddoc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182023 (https://phabricator.wikimedia.org/T401220) [06:49:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182023 (https://phabricator.wikimedia.org/T401220) (owner: 10DCausse) [06:49:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:49:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:09] (03PS3) 10Filippo Giunchedi: openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) [06:53:21] (03PS1) 10Arnaudb: gerrit: lower thresholds to drop abusers quicker [puppet] - 10https://gerrit.wikimedia.org/r/1181990 (https://phabricator.wikimedia.org/T402611) [06:54:32] (03CR) 10Arnaudb: [C:03+2] gerrit: lower thresholds to drop abusers quicker [puppet] - 10https://gerrit.wikimedia.org/r/1181990 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [07:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:00:39] I can deploy [07:01:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182023 (https://phabricator.wikimedia.org/T401220) (owner: 10DCausse) [07:02:07] (03Merged) 10jenkins-bot: SECURITY: declare PoolCounter settings for cirrusbuilddoc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182023 (https://phabricator.wikimedia.org/T401220) (owner: 10DCausse) [07:02:31] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182023|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] [07:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:05:43] (03PS1) 10Abijeet Patro: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182031 (https://phabricator.wikimedia.org/T402496) [07:08:27] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1182023|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:48] testing [07:14:14] (03CR) 10Ayounsi: [C:03+1] Update DHCP server in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1181734 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:14:33] (03CR) 10Ayounsi: [C:03+1] Point DHCP server in drmrs to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181733 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:15:04] (03PS2) 10Slyngshede: P:cache::varnish::frontend user-agent rate limit cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) [07:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:17:07] (03PS3) 10Slyngshede: P:cache::varnish::frontend user-agent rate limit cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) [07:20:25] (03CR) 10Ayounsi: "Awesome thanks a lot, that's exactly what we needed !" [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [07:20:42] (03CR) 10Abijeet Patro: [C:03+1] Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182031 (https://phabricator.wikimedia.org/T402496) (owner: 10Abijeet Patro) [07:23:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182031 (https://phabricator.wikimedia.org/T402496) (owner: 10Abijeet Patro) [07:24:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:24:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:34] dcausse: Let me know once backport is done. I'm deploying patch from abijeet after that. [07:26:55] kart_: sure, sorry testing is taking a bit longer than expected [07:27:04] no worries! [07:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:29:44] (03PS1) 10Vgutierrez: varnish: Add a user-agent to run.py HTTPS requests [puppet] - 10https://gerrit.wikimedia.org/r/1182033 (https://phabricator.wikimedia.org/T400119) [07:30:34] (03CR) 10Vgutierrez: [C:03+1] "varnishtests are happy for both upload & text clusters" [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [07:30:49] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182033 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [07:31:04] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11117750 (10FCeratto-WMF) [07:31:09] (03CR) 10Vgutierrez: [C:03+2] varnish: Add a user-agent to run.py HTTPS requests [puppet] - 10https://gerrit.wikimedia.org/r/1182033 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [07:34:27] (03PS1) 10Filippo Giunchedi: wmcs: alert on nova agents unavailable [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) [07:39:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:39:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:41:37] (03PS1) 10Brouberol: airflow: grant permissions to get/list events in the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182078 [07:42:42] !log dcausse@deploy1003 dcausse: Continuing with sync [07:48:09] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182023|SECURITY: declare PoolCounter settings for cirrusbuilddoc (T401220)]] (duration: 45m 38s) [07:48:40] (03PS1) 10Arnaudb: gerrit: throttle update comment [puppet] - 10https://gerrit.wikimedia.org/r/1182082 (https://phabricator.wikimedia.org/T402847) [07:48:42] (03CR) 10Arnaudb: [C:03+2] gerrit: throttle update comment [puppet] - 10https://gerrit.wikimedia.org/r/1182082 (https://phabricator.wikimedia.org/T402847) (owner: 10Arnaudb) [07:49:26] (03CR) 10Slyngshede: [C:03+2] P:cache::varnish::frontend user-agent rate limit cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [07:49:40] dcausse Done? [07:49:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.351 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.458 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:48] kart_: no sorry [07:49:59] oh OK :) [07:50:18] I need to revert... :( [07:50:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:50:30] No problem. [07:50:34] (03PS1) 10DCausse: Revert "SECURITY: declare PoolCounter settings for cirrusbuilddoc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182083 [07:50:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182083 (owner: 10DCausse) [07:51:49] (03Merged) 10jenkins-bot: Revert "SECURITY: declare PoolCounter settings for cirrusbuilddoc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182083 (owner: 10DCausse) [07:52:04] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182083|Revert "SECURITY: declare PoolCounter settings for cirrusbuilddoc"]] [07:52:10] (03CR) 10Vgutierrez: sre.loadbalancer: modify admin.py to accept 'reboot' action (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [07:53:13] (03CR) 10Kosta Harlan: [C:04-1] "Let me double-check with hCaptcha about this one before we merge it" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [07:55:21] FIRING: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:55:39] ^ this is probably me, revert in progress [07:56:40] (03CR) 10Kosta Harlan: [C:04-1] "Can we add a "Depends-On" or otherwise reference the relevant changes for DNS and backend.yaml here?" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [07:57:29] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1182083|Revert "SECURITY: declare PoolCounter settings for cirrusbuilddoc"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:57:51] !log dcausse@deploy1003 dcausse: Continuing with sync [08:00:04] andre and jnuche: Time to do the MediaWiki train - Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0800). [08:00:45] FIRING: [2x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:00:51] jnuche: OK to do one backport before train? [08:01:19] kart_: yes [08:01:33] train is waiting for some rebased patches anyway [08:01:58] cool. Waiting for revert from dcausse to finish. [08:02:58] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182083|Revert "SECURITY: declare PoolCounter settings for cirrusbuilddoc"]] (duration: 10m 53s) [08:04:07] kart_: give me one more minute to verify that the alerts are gone [08:04:26] sure [08:05:12] kart_: all good, sorry about that [08:05:21] RESOLVED: [2x] PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:05:27] no problem. [08:05:27] (03PS1) 10MVernon: Thanos: remove drained thanos-be2005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1182084 (https://phabricator.wikimedia.org/T400876) [08:05:45] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:06:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182031 (https://phabricator.wikimedia.org/T402496) (owner: 10Abijeet Patro) [08:08:02] (03Merged) 10jenkins-bot: Setup tracking for CentralNotice banners experiment for WE2.1.1 [extensions/WikimediaCampaignEvents] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182031 (https://phabricator.wikimedia.org/T402496) (owner: 10Abijeet Patro) [08:08:16] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1182031|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] [08:08:21] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:10:57] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [08:13:33] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1182031|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:13:39] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:19:05] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:19:11] sigh, that's me [08:19:18] dcaro: ^ [08:19:41] okok [08:19:53] an alerting/puppet race I think, I'm roll-restarting nova-compute and libvirt due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1181676?usp=dashboard [08:20:05] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:20:08] I was thinking now that maybe it was one of those "I did not resolve the ack in victorops" thingies [08:20:10] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: KubernetesContainerReachingMemoryLimit (instance wikikube-worker1119.eqiad.wmnet) - https://phabricator.wikimedia.org/T402886 (10LSobanski) 03NEW [08:20:12] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182084 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [08:20:41] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887 (10LSobanski) 03NEW [08:20:54] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp2001:9100) - https://phabricator.wikimedia.org/T402888 (10LSobanski) 03NEW [08:21:07] (03PS1) 10Filippo Giunchedi: openstack: move nova-compute alerts to higher level [puppet] - 10https://gerrit.wikimedia.org/r/1182085 (https://phabricator.wikimedia.org/T402778) [08:21:09] hehe [08:21:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:25] Still on testing.. [08:22:41] (03PS14) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) [08:27:23] (03PS3) 10Máté Szabó: hcaptcha: Add proxied CSP reporting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1181675 [08:27:46] (03CR) 10Máté Szabó: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [08:29:41] !log kartik@deploy1003 abi, kartik: Continuing with sync [08:31:54] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [08:32:22] (03CR) 10MVernon: [C:03+2] Thanos: remove drained thanos-be2005 for disk controller swap [puppet] - 10https://gerrit.wikimedia.org/r/1182084 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [08:32:49] (03CR) 10Kosta Harlan: [C:04-1] "I think this probably needs to go out together with I4cbe2ad4a8e8dc83bc6b2e07d47ec6b4d14c347a, so I'm going to place a -1 on it for now. (" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [08:35:02] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182031|Setup tracking for CentralNotice banners experiment for WE2.1.1 (T402496)]] (duration: 26m 46s) [08:35:07] T402496: Tracking code for Scenarios 1 for WE2.1.1 - https://phabricator.wikimedia.org/T402496 [08:37:25] (03CR) 10Fabfur: [C:03+1] "TIL about errors=replace, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1181125 (https://phabricator.wikimedia.org/T402634) (owner: 10Vgutierrez) [08:38:10] kart_: all done with backporting? [08:38:56] andre_: yes. done. [08:39:03] alright, thanks [08:39:26] (03CR) 10David Caro: "Looks good, though I have some questions xd" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [08:42:00] (03PS1) 10DCausse: Revert "NetworkSession: Only enable for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T401220) [08:43:50] (03CR) 10Máté Szabó: "Probably easiest to just squash the patches then. I'll do that." [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [08:44:26] (03PS2) 10DCausse: Revert "NetworkSession: Only enable for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) [08:45:27] (03PS1) 10Fabfur: profile:cache: setting UA for haproxy tests [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) [08:45:59] (03PS4) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181675 [08:46:43] (03Abandoned) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [08:47:46] (03CR) 10CI reject: [V:04-1] profile:cache: setting UA for haproxy tests [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [08:49:53] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Merging as per IRC, @kharlan@wikimedia.org will validate functionality after the deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [08:50:56] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix UnicodeDecodeError in varnishlog output parsing [puppet] - 10https://gerrit.wikimedia.org/r/1181125 (https://phabricator.wikimedia.org/T402634) (owner: 10Vgutierrez) [08:51:05] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [08:53:12] (03CR) 10Vgutierrez: profile:cache: setting UA for haproxy tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [08:53:28] (03PS2) 10Fabfur: profile:cache: setting UA for haproxy tests [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) [08:54:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:57:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:59:18] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887#11118034 (10JMeybohm) a:03jijiki @jijiki could you take a look? I think this is alerting since you set the instances up in {T276994} [08:59:38] (03CR) 10Vgutierrez: [C:03+1] profile:cache: setting UA for haproxy tests [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [09:00:04] (03CR) 10Muehlenhoff: [C:03+2] Assign installserver role to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181732 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:00:34] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182091 (https://phabricator.wikimedia.org/T396377) [09:00:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182091 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [09:01:23] (03CR) 10Volans: "General approach LGTM, a logical question and some minor comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [09:01:32] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182091 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [09:03:34] !log aklapper@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.16 refs T396377 [09:03:40] T396377: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377 [09:10:08] (03CR) 10Brouberol: [C:03+2] stat: deploy an analytics-ml keytab on each host [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:13:02] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1181734 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:13:19] (03CR) 10Muehlenhoff: [C:03+2] Point DHCP server in drmrs to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181733 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:22:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11118155 (10MatthewVernon) [09:23:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11118157 (10MatthewVernon) @Jhancock.wm thanos-be2005 is now also ready to go; if you've not time to do 3 hosts today, please do the... [09:23:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11118158 (10phaultfinder) [09:24:58] (03CR) 10Hnowlan: [C:03+1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [09:25:56] (03PS1) 10Vgutierrez: varnish: Fix UnicodeDecodeError in varnish output parsing take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1182094 (https://phabricator.wikimedia.org/T402634) [09:28:12] (03PS2) 10Vgutierrez: varnish: Fix UnicodeDecodeError in varnish output parsing take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1182094 (https://phabricator.wikimedia.org/T402634) [09:28:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11118176 (10phaultfinder) [09:32:38] (03PS1) 10Brouberol: stat: group chown the analytics-ml to analytics-ml-users [puppet] - 10https://gerrit.wikimedia.org/r/1182095 (https://phabricator.wikimedia.org/T400902) [09:33:33] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6740/co" [puppet] - 10https://gerrit.wikimedia.org/r/1182095 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:34:28] (03CR) 10Fabfur: [C:03+1] varnish: Fix UnicodeDecodeError in varnish output parsing take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1182094 (https://phabricator.wikimedia.org/T402634) (owner: 10Vgutierrez) [09:34:47] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix UnicodeDecodeError in varnish output parsing take #2 [puppet] - 10https://gerrit.wikimedia.org/r/1182094 (https://phabricator.wikimedia.org/T402634) (owner: 10Vgutierrez) [09:34:50] (03PS1) 10Tiziano Fogli: mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) [09:34:50] (03CR) 10Tiziano Fogli: "Requesting a first pass of review from the O11y team, as these checks are the first ones using prometheus::alert::rule outside the O11y pe" [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:35:36] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887#11118205 (10Clement_Goubert) I pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1181673?usp=dashboard yesterday but apparently that wasn... [09:35:43] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887#11118206 (10Clement_Goubert) p:05Triage→03Medium [09:41:46] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in drmrs to install6003 [dns] - 10https://gerrit.wikimedia.org/r/1181736 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:41:51] !log jmm@dns1004 START - running authdns-update [09:43:02] !log jmm@dns1004 END - running authdns-update [09:43:21] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11118244 (10Miriam) Oh no I am so sorry, I thought I've already approved this! So sorry for the delay. I approve yes! [09:45:55] (03CR) 10Ozge: [C:03+1] stat: group chown the analytics-ml to analytics-ml-users [puppet] - 10https://gerrit.wikimedia.org/r/1182095 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [09:47:13] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install5002.wikimedia.org [09:47:55] (03CR) 10Fabfur: profile:cache: setting UA for haproxy tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [09:48:06] (03CR) 10Fabfur: [C:03+2] profile:cache: setting UA for haproxy tests [puppet] - 10https://gerrit.wikimedia.org/r/1182089 (https://phabricator.wikimedia.org/T400119) (owner: 10Fabfur) [09:49:17] (03PS1) 10Clément Goubert: mw_experimental: More permission fixes [puppet] - 10https://gerrit.wikimedia.org/r/1182101 (https://phabricator.wikimedia.org/T402887) [09:49:34] !log aklapper@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.16 refs T396377 (duration: 45m 59s) [09:49:37] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182101 (https://phabricator.wikimedia.org/T402887) (owner: 10Clément Goubert) [09:49:39] T396377: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377 [09:50:04] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182102 (https://phabricator.wikimedia.org/T396377) [09:50:06] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182102 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [09:51:00] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182102 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [09:51:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:52:51] (03CR) 10Clément Goubert: [C:03+2] mw_experimental: More permission fixes [puppet] - 10https://gerrit.wikimedia.org/r/1182101 (https://phabricator.wikimedia.org/T402887) (owner: 10Clément Goubert) [09:54:01] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [09:54:14] (03CR) 10Stevemunene: [C:03+1] airflow: grant permissions to get/list events in the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182078 (owner: 10Brouberol) [09:54:25] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [09:54:35] FIRING: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:30] (03CR) 10Volans: "reply inline" [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [09:56:05] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:56:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:56:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install5002.wikimedia.org [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:37] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11118288 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install5002.wikimedia.org` - install5002.wikimedia.org (**PASS**) - Do... [09:57:08] (03PS1) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [09:57:54] (03PS2) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [09:58:29] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (owner: 10Máté Szabó) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1000) [10:00:12] ^^^ Note that I am still deploying the train to group0; should be done in ~10-15min if no revert is needed [10:01:20] (03CR) 10Kosta Harlan: [C:03+1] "We have approval for this." [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [10:01:54] (03CR) 10JMeybohm: [V:03+2 C:03+2] "LGTM, thanks for the update and informative commit message!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans) [10:02:48] (03PS1) 10Ayounsi: asw1-b3-magru: remove sandbox firewall [homer/public] - 10https://gerrit.wikimedia.org/r/1182109 (https://phabricator.wikimedia.org/T402372) [10:03:56] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp2001:9100) - https://phabricator.wikimedia.org/T402888#11118327 (10Clement_Goubert) Same issue as T402887 [10:04:49] andre_: ack [10:05:22] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.16 refs T396377 [10:05:28] T396377: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377 [10:06:36] (03CR) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (owner: 10Máté Szabó) [10:06:43] (03CR) 10Cathal Mooney: [C:03+1] asw1-b3-magru: remove sandbox firewall [homer/public] - 10https://gerrit.wikimedia.org/r/1182109 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [10:08:07] group0 looks okay so far, handing over to the MediaWiki infrastructure timeslot. Sorry for being a bit over time [10:08:11] claime, ^ [10:08:20] all good, ty [10:09:18] (03CR) 10Ayounsi: [C:03+2] asw1-b3-magru: remove sandbox firewall [homer/public] - 10https://gerrit.wikimedia.org/r/1182109 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [10:10:27] (03Merged) 10jenkins-bot: ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans) [10:10:33] (03Merged) 10jenkins-bot: asw1-b3-magru: remove sandbox firewall [homer/public] - 10https://gerrit.wikimedia.org/r/1182109 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [10:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install6002.wikimedia.org [10:21:09] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182112 (https://phabricator.wikimedia.org/T380592) [10:21:50] jmm@cumin2002 decommission (PID 3961681) is awaiting input [10:22:49] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11118402 (10MoritzMuehlenhoff) [10:23:39] RESOLVED: JobUnavailable: Reduced availability for job squid in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:28:25] 10SRE-swift-storage, 10MediaWiki-Uploading, 07Wikimedia-production-error: UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T395049#11118418 (10Aklapper) There are three more such entries today in Logstash, all around `... [10:28:39] FIRING: [2x] JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:34] !log jmm@dns1004 START - running authdns-update [10:31:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:31:42] !log jmm@dns1004 END - running authdns-update [10:31:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:31:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:31:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install6002.wikimedia.org [10:32:02] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11118446 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install6002.wikimedia.org` - install6002.wikimedia.org (**PASS**) - Do... [10:37:05] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1244 gradually with 4 steps - Work done [10:39:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:39:56] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.618 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.785 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:45:20] (03CR) 10Bartosz Dziewoński: [C:03+1] Revert "NetworkSession: Only enable for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) (owner: 10DCausse) [10:49:34] RESOLVED: JobUnavailable: Reduced availability for job squid in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:51:34] (03PS1) 10Muehlenhoff: Remove obsolete stub keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1182114 (https://phabricator.wikimedia.org/T396487) [10:53:15] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/1182114 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [10:53:53] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11118539 (10MoritzMuehlenhoff) [10:58:23] (03PS3) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [10:58:45] (03CR) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (owner: 10Máté Szabó) [10:58:55] (03PS4) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [11:00:57] (03CR) 10CI reject: [V:04-1] hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (owner: 10Máté Szabó) [11:02:17] (03PS5) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [11:03:56] (03PS6) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [11:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:08:06] (03CR) 10Volans: "Sorry for the delay, I did complete a full pass and left various comments. Feel free to ping me to discuss any of them in mode detail." [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [11:08:18] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1182116 (https://phabricator.wikimedia.org/T402912) [11:09:10] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (owner: 10Máté Szabó) [11:09:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Primary switchover es7 T402912 [11:10:03] T402912: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T402912 [11:10:12] (03PS1) 10Muehlenhoff: Rebuild against the new ffmpeg 5.1.7 security release in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1182117 [11:10:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set es2038 with weight 0 T402912', diff saved to https://phabricator.wikimedia.org/P81758 and previous config saved to /var/cache/conftool/dbconfig/20250826-111015-ladsgroup.json [11:13:34] (03CR) 10Hnowlan: [C:03+1] Rebuild against the new ffmpeg 5.1.7 security release in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1182117 (owner: 10Muehlenhoff) [11:14:08] (03CR) 10Hnowlan: [C:03+2] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [11:14:24] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote es2038 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1182116 (https://phabricator.wikimedia.org/T402912) (owner: 10Gerrit maintenance bot) [11:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:14:57] !log Starting es7 codfw failover from es2039 to es2038 - T402912 [11:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote es2038 to es7 primary T402912', diff saved to https://phabricator.wikimedia.org/P81759 and previous config saved to /var/cache/conftool/dbconfig/20250826-111630-ladsgroup.json [11:16:36] T402912: Switchover es7 master (es2039 -> es2038) - https://phabricator.wikimedia.org/T402912 [11:19:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool es2039 T402912', diff saved to https://phabricator.wikimedia.org/P81760 and previous config saved to /var/cache/conftool/dbconfig/20250826-111927-ladsgroup.json [11:22:28] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against the new ffmpeg 5.1.7 security release in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1182117 (owner: 10Muehlenhoff) [11:22:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1244 gradually with 4 steps - Work done [11:24:11] (03CR) 10Elukey: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) (owner: 10Vgutierrez) [11:25:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1039.eqiad.wmnet with reason: Glow up (T399927) [11:25:15] T399927: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927 [11:25:27] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es2039.codfw.wmnet with reason: Glow up (T399927) [11:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:30:21] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11118696 (10Ladsgroup) es2039 is shut down and ready for you. I rather not move it for now. Let me know if I picked the wrong NIC interface. [11:37:46] Hi folks, I need to run 3 more DB queries in x1.wikishared to fix the same production error as yesterday. The queries are at the bottom of T402239#11118710 and really fast. I'd like to run them now unless told otherwise. [11:37:47] T402239: RuntimeException: Event should have only one address. - https://phabricator.wikimedia.org/T402239 [11:42:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) (owner: 10DCausse) [11:44:31] (03PS1) 10Muehlenhoff: humbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182119 [11:48:57] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887#11118756 (10Clement_Goubert) 05Open→03Resolved Permission fix looks like it resolved the alert. [11:49:29] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp2001:9100) - https://phabricator.wikimedia.org/T402888#11118761 (10Clement_Goubert) →14Duplicate dup:03T402887 [11:49:30] 07sre-alert-triage, 06serviceops: Alert in need of triage: PuppetConstantChange (instance wikikube-worker-exp1001:9100) - https://phabricator.wikimedia.org/T402887#11118763 (10Clement_Goubert) [11:55:58] !log Running queries from T402239#11118710 in x1.wikishared to fix broken event addresses (again) [11:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:04] T402239: RuntimeException: Event should have only one address. - https://phabricator.wikimedia.org/T402239 [11:59:39] (03CR) 10David Caro: [C:03+1] "LGTM once the other is in" [puppet] - 10https://gerrit.wikimedia.org/r/1182085 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1200) [12:06:51] (03PS2) 10Filippo Giunchedi: wmcs: alert on nova agents unavailable [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) [12:07:01] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181687 (owner: 10PipelineBot) [12:07:04] (03CR) 10Filippo Giunchedi: "After a chat on IRC I have flipped to page, and we can/should revisit at the team meeting. I have also scoped the alert to nova-compute fo" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [12:07:35] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180933 (owner: 10PipelineBot) [12:07:44] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180931 (owner: 10PipelineBot) [12:07:56] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180848 (owner: 10PipelineBot) [12:09:13] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181687 (owner: 10PipelineBot) [12:11:38] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:12:02] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:14:33] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:15:17] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:15:28] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:16:14] (03CR) 10Brouberol: [V:03+1 C:03+2] stat: group chown the analytics-ml to analytics-ml-users [puppet] - 10https://gerrit.wikimedia.org/r/1182095 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [12:16:15] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:18:01] (03CR) 10David Caro: [C:03+1] "LGTM (keeping current alerts more or less, after the meet we can see if we drop the page)" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [12:21:25] (03PS1) 10Cyndywikime: [Growth]: Remove obsolete no-link-recommendation variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182125 (https://phabricator.wikimedia.org/T402769) [12:32:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): decommission an-druid100[1-2] - https://phabricator.wikimedia.org/T402814#11118862 (10Jclark-ctr) [12:32:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): decommission an-druid100[1-2] - https://phabricator.wikimedia.org/T402814#11118863 (10Jclark-ctr) 05Open→03Resolved [12:33:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11118868 (10Jhancock.wm) it's on site.pp and the pressed [12:34:44] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11118876 (10Jclark-ctr) a:03VRiley-WMF [12:35:40] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11118885 (10Jclark-ctr) [12:40:26] (03CR) 10Phuedx: [C:03+1] xLab: Deploy v0.8.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182112 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [12:40:54] (03CR) 10Slyngshede: [V:03+1 C:03+1] "LGTM, the header is still set in Envoy and TrafficServer by some Lua scripts, but that is unrelated." [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [12:47:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11118953 (10cmooney) >>! In T378828#11116548, @VRiley-WMF wrote: > @cmooney Thanks! The second link on cloudcephosd1045 in port 23 in cloudsw1-d5-eqiad. I also made a few changes to the cable... [12:48:48] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:49:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:16] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:52:12] (03CR) 10Hnowlan: [C:03+1] humbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182119 (owner: 10Muehlenhoff) [12:54:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnscloudcephosd1052 - jclark@cumin1002" [12:54:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dnscloudcephosd1052 - jclark@cumin1002" [12:54:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:36] (03CR) 10Andrew Bogott: "If I understand this correctly, the service status is coming from a nova API call itself, right? Will we get some other kind of useful ale" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [12:54:49] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 2.919 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:54:49] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.930 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:06] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:55:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:55:48] (03CR) 10Muehlenhoff: [C:03+2] humbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182119 (owner: 10Muehlenhoff) [12:56:55] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [12:57:04] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:59:24] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11118973 (10Andrew) 05Open→03Resolved These are all working now! Thanks @VRiley-WMF and @cmooney [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1300). [13:00:05] MatmaRex and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:38] hi [13:01:08] i need someone to deploy things for me :) [13:01:10] o/ [13:01:44] I can deploy :) [13:01:58] MatmaRex: I had a look at your changes earlier – they can probably all be deployed together, right? [13:02:08] should be safe to assume the maintenance scripts won’t do anything on their own [13:02:08] yeah [13:02:32] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:02:49] ok then let’s go [13:03:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602) (owner: 10Bartosz Dziewoński) [13:03:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [13:03:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:06:34] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:08:18] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11119026 (10Andrew) 05Open→03Resolved [13:09:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1052.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:11:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1052.eqiad.wmnet with OS bullseye [13:13:35] (03CR) 10Eevans: [C:03+1] image-suggestion: cleanup unused refs to service listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [13:17:07] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:18:34] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:20:03] (03Merged) 10jenkins-bot: PHPSessionHandler: Better handle objects stored in the session [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602) (owner: 10Bartosz Dziewoński) [13:20:05] (03Merged) 10jenkins-bot: Add maint script to fix global edit count of renamed users [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [13:20:07] (03Merged) 10jenkins-bot: Add maint script to fix wrong actors in local log entries for global renames [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [13:20:13] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:20:43] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1181782|PHPSessionHandler: Better handle objects stored in the session (T402602)]], [[gerrit:1181788|Add maint script to fix global edit count of renamed users (T313900)]], [[gerrit:1181789|Add maint script to fix wrong actors in local log entries for global renames (T398177)]] [13:20:50] T402602: Storing objects in session data causes unnecessary session writes, and emits spurious warnings with $wgPHPSessionHandling = 'warn' - https://phabricator.wikimedia.org/T402602 [13:20:51] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:20:51] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:21:56] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [13:22:11] (03PS1) 10MVernon: swift: add 4 new eqiad frontends ms-fe10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/1182142 (https://phabricator.wikimedia.org/T401448) [13:22:24] (03CR) 10Cathal Mooney: "Thanks a mil for the feedback! Some great points I'll get working on them, will likely ping you on some of them where I'm less sure. Che" [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:24:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:54] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [13:25:11] (03CR) 10Eevans: [C:03+1] Revert "NetworkSession: Only enable for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) (owner: 10DCausse) [13:25:53] (03CR) 10Eevans: [C:03+1] swift: add 4 new eqiad frontends ms-fe10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/1182142 (https://phabricator.wikimedia.org/T401448) (owner: 10MVernon) [13:26:32] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:26:59] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Backport for [[gerrit:1181782|PHPSessionHandler: Better handle objects stored in the session (T402602)]], [[gerrit:1181788|Add maint script to fix global edit count of renamed users (T313900)]], [[gerrit:1181789|Add maint script to fix wrong actors in local log entries for global renames (T398177)]] synced to the testservers (see https://wikitech.wikim [13:26:59] edia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:07] T402602: Storing objects in session data causes unnecessary session writes, and emits spurious warnings with $wgPHPSessionHandling = 'warn' - https://phabricator.wikimedia.org/T402602 [13:27:07] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:27:08] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:27:15] not much to test here [13:27:22] i can dobule-check that logins are not broken [13:27:26] sounds good [13:27:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:55] looks good [13:28:03] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2039 [13:28:13] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2039 [13:28:24] !log lucaswerkmeister-wmde@deploy1003 matmarex, lucaswerkmeister-wmde: Continuing with sync [13:28:26] alright, thanks [13:28:26] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host frmx2002 [13:28:41] was the merge slower than usual, btw? [13:28:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host frmx2002 [13:28:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11119108 (10phaultfinder) [13:29:06] I don’t think so? [13:29:16] to me it felt normally slow for a backport (as opposed to a config change) [13:29:19] certainly slower than yesterday… https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1181695/1#message-abd5f5b64427439318572bd5ad2e27054a08a7b7 vs https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1181782/2#message-495f44aa981247d37947790a4bc53a04850eb800 [13:29:32] from 1 minute to 15 minutes [13:29:50] (03CR) 10MVernon: [C:03+2] swift: add 4 new eqiad frontends ms-fe10[17-20] [puppet] - 10https://gerrit.wikimedia.org/r/1182142 (https://phabricator.wikimedia.org/T401448) (owner: 10MVernon) [13:29:52] well, yesterday was “Skipping remaining commands due to success cache hit” [13:29:58] (not sure where the cache entry came from tbh) [13:30:21] oh, right [13:30:27] yeah that explains it [13:30:32] i'm not sure either [13:30:33] looks like you got lucky and all slow builds hit the success cache yesterday [13:30:49] *looks* I guess it came from the main test build? [13:30:56] (03CR) 10Urbanecm: "code-wise, looks good, this is now blocked on the deployment date (likely Sept 02, but I asked in Slack to double check)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [13:31:09] yeah… i think it's supposed to be shared with test builds [13:31:13] not sure why that wouldn’t have been the case today, maybe something else got backported in the meantime [13:31:21] so if there were no other merges on that branch in the meantime, the cache is used [13:31:27] (03PS7) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 [13:31:30] i prepared today's patch yesterday [13:31:38] (03CR) 10Urbanecm: [C:04-1] "actually...i just noticed one last thing, i'm sorry. see inline!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [13:31:39] but i prepared yesterday's patch like 15 minutes before the window [13:31:47] mhm [13:31:55] so… we should probably rebase or "recheck" the patches before the window [13:32:14] i haven't really thought about that before [13:33:37] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181782|PHPSessionHandler: Better handle objects stored in the session (T402602)]], [[gerrit:1181788|Add maint script to fix global edit count of renamed users (T313900)]], [[gerrit:1181789|Add maint script to fix wrong actors in local log entries for global renames (T398177)]] (duration: 12m 54s) [13:33:39] actually, a change in any repository on the wmf branch would probably invalidate the cache, right? [13:33:45] T402602: Storing objects in session data causes unnecessary session writes, and emits spurious warnings with $wgPHPSessionHandling = 'warn' - https://phabricator.wikimedia.org/T402602 [13:33:45] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [13:33:45] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [13:33:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11119135 (10phaultfinder) [13:33:54] so https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaCampaignEvents/+/1182031 might have been the change to invalidate the cache (backported earlier this morning) [13:33:58] anyway [13:34:12] let’s run those maintenance scripts [13:34:13] (03CR) 10Urbanecm: [C:04-1] "also cross-linking my comment on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1182125" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [13:34:42] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-fe[1017-1020].eqiad.wmnet with reason: reboot before bringing into service [13:35:01] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:35:25] yes, i think that's how it works [13:35:29] MatmaRex: any idea how long those scripts will take? [13:35:41] minutes [13:35:43] ok [13:35:51] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: CentralAuth:FixRenamedUserGlobalEditCount metawiki # T313900 (dry run) [13:36:09] * Lucas_WMDE chuckles at Hoo'sRenameTest1 [13:36:15] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119143 (10Urbanecm_WMF) > Our goal is to block all traffic from unidentified clients and not coming from authorized actors, like toolsforge or our intern... [13:36:17] the first one is faster, since it's just one run, maybe it will be just seconds [13:36:36] the second one does about as much work as the first one, but for every wiki [13:36:40] so far it’s all “already correct” and “not found (maybe renamed again)” [13:36:55] yeah I need to look up how to foreachwikiindblist on k8s first [13:37:02] (also, how convenient that we have the sul dblist now ^^) [13:37:29] it probably has to scan all entries since like 2014, but the bug it fixes was only introduced in 202x (not sure when) [13:38:10] ok https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Running_on_multiple_wikis_(the_safe_way) looks promising [13:38:30] “Usurped account 20141121” sounds like it’s still in 2014 [13:39:43] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Swift [13:39:43] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.884 second response time https://wikitech.wikimedia.org/wiki/Swift [13:39:43] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.387 second response time https://wikitech.wikimedia.org/wiki/Swift [13:39:43] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.402 second response time https://wikitech.wikimedia.org/wiki/Swift [13:39:47] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11119164 (10Jhancock.wm) @Ladsgroup es2039's cable port has been moved and all the nextbox entries have been updated. let us know if you need any further ass... [13:39:48] dcausse: if you want, you can probably start your deployment in the meantime [13:39:54] while I run the maintenance scripts [13:40:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-fe[1017-1020].eqiad.wmnet [13:40:28] Lucas_WMDE: ok thanks [13:40:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe[1017-1020].eqiad.wmnet [13:40:32] Lucas_WMDE: yes, that wikitech entry is The Right Way [13:40:36] yay [13:40:58] aha, starting to see “Would correct edit count” in the output [13:41:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) (owner: 10DCausse) [13:41:51] (03PS4) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [13:41:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:42:00] (03PS1) 10Ayounsi: CI: check style (black + isort) [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 [13:42:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T401906)', diff saved to https://phabricator.wikimedia.org/P81763 and previous config saved to /var/cache/conftool/dbconfig/20250826-134201-fceratto.json [13:42:06] MatmaRex: so far the visible difference is numbers I could count on my fingers fwiw ^^ [13:42:06] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:42:07] e.g. from 227353 to 227354 [13:42:11] or one was from 92 to 89 iirc [13:42:14] does that sound right? [13:42:22] (03Merged) 10jenkins-bot: Revert "NetworkSession: Only enable for private wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182088 (https://phabricator.wikimedia.org/T373826) (owner: 10DCausse) [13:42:26] (since the task says “doubles their edit count”) [13:42:34] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [13:42:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:42:45] Lucas_WMDE: yes, there are two reasons [13:42:46] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]] [13:42:51] T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems - https://phabricator.wikimedia.org/T373826 [13:43:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T401906)', diff saved to https://phabricator.wikimedia.org/P81764 and previous config saved to /var/cache/conftool/dbconfig/20250826-134311-fceratto.json [13:43:13] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [13:43:15] Lucas_WMDE: one is that is a user is renamed, then keeps making edits, then the total count won't be off by exactly 2x. i think people are often renamed while having very few edits [13:43:15] ok [13:43:42] Lucas_WMDE: the other is that the counters are sometimes just off because they're not properly transactional, so some updates are missed [13:44:09] Lucas_WMDE: and actually there ay be a third reason related to the handling of hidden or deleted edits, but i did not dive into that [13:44:13] alright, thanks! [13:44:17] (first script is still running btw) [13:44:25] basically, edit counters are the worst [13:44:28] (and still *mostly* “already correct”) [13:44:37] !bash basically, edit counters are the worst [13:44:37] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/mOuf5pgBffdvpiTrn5Fb [13:44:41] (almost !log’ged that lol) [13:44:47] heh [13:45:49] i guess that's a bit slower than i expected. so the second script may take longer to run [13:46:03] yeah feels like that’ll be a “rest of the day” affair [13:46:11] fortunately in k8s land it doesn’t rely on my connection staying alive ;) [13:46:32] > Edit count already correct for 'Previous username' [13:46:33] heh [13:47:18] logspam-watch looks like a promising dropoff for the session warnings btw [13:47:47] yeah last logstash message for that was 13:31:47.979 \o/ [13:48:06] nice [13:48:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1052.eqiad.wmnet with reason: host reimage [13:48:42] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:47] T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems - https://phabricator.wikimedia.org/T373826 [13:48:51] (03PS2) 10Ayounsi: CI: check style (black + isort) [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 [13:48:51] (03PS1) 10Ayounsi: Format existing python files using black and isort [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 [13:49:25] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [13:49:37] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11119221 (10Ladsgroup) Thank you. I started the replication and it looks like it's working fine (and quite fast too!!!!) [13:49:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:38] (03CR) 10Ayounsi: Format existing python files using black and isort (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 (owner: 10Ayounsi) [13:51:27] !log dcausse@deploy1003 dcausse: Continuing with sync [13:53:47] (03CR) 10Volans: "replies inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1180545 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:55:30] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11119259 (10Jhancock.wm) [13:55:44] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402875#11119263 (10Jhancock.wm) 05Open→03Declined [13:56:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [13:56:15] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet [13:56:16] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet [13:56:17] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1017.eqiad.wmnet [13:56:18] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1017.eqiad.wmnet [13:56:19] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet [13:56:20] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet [13:56:21] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1018.eqiad.wmnet [13:56:21] (03CR) 10Volans: [C:03+1] "LGTM, minor nit inline [no need to re-review]" [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 (owner: 10Ayounsi) [13:56:21] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1018.eqiad.wmnet [13:56:22] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet [13:56:23] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet [13:56:24] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1019.eqiad.wmnet [13:56:25] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1019.eqiad.wmnet [13:56:26] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet [13:56:27] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet [13:56:28] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1020.eqiad.wmnet [13:56:28] mvernon@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [13:56:29] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1020.eqiad.wmnet [13:56:35] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]] (duration: 13m 49s) [13:57:04] (03PS3) 10Vgutierrez: cache::haproxy: Use %rt instead of %ID for log sequence numbers [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) [13:57:07] T373826: NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems - https://phabricator.wikimedia.org/T373826 [13:57:20] (03CR) 10CI reject: [V:04-1] cache::haproxy: Use %rt instead of %ID for log sequence numbers [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [13:57:27] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2017.codfw.wmnet [13:57:28] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2017.codfw.wmnet [13:57:29] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2017.codfw.wmnet [13:57:30] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2017.codfw.wmnet [13:57:31] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2018.codfw.wmnet [13:57:31] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2018.codfw.wmnet [13:57:32] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2018.codfw.wmnet [13:57:33] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2018.codfw.wmnet [13:57:34] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2019.codfw.wmnet [13:57:35] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2019.codfw.wmnet [13:57:36] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2019.codfw.wmnet [13:57:37] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2019.codfw.wmnet [13:57:38] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2020.codfw.wmnet [13:57:38] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2020.codfw.wmnet [13:57:39] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2020.codfw.wmnet [13:57:40] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2020.codfw.wmnet [13:57:46] (03PS1) 10Tiziano Fogli: nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) [13:57:58] “Would correct edit count for 'Martin Urbanec': from 194067 to 194075” oh hey I know that name [13:57:58] (03CR) 10Volans: [C:03+1] "LGTM, nit inline" [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 (owner: 10Ayounsi) [13:58:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81765 and previous config saved to /var/cache/conftool/dbconfig/20250826-135818-fceratto.json [13:58:58] Lucas_WMDE: I'm done with my deploy [13:59:15] kask latencies look ok to me [13:59:30] !log UTC afternoon backport+config window done (maintenance scripts are ongoing and will probably take a while longer) [13:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:34] alright, thanks! [13:59:59] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119288 (10Vgutierrez) >>! In T400119#11119143, @Urbanecm_WMF wrote: >> Our goal is to block all traffic from unidentified clients and not coming from aut... [14:02:07] (03PS4) 10Vgutierrez: cache::haproxy: Use %rt instead of %ID for log sequence numbers [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) [14:02:16] I have no idea how to judge how much progress the script has made btw [14:02:25] I can’t find any rename log entries for the user names it’s printing [14:02:47] (are rename log entries logged with the *old* user name as the target, and that’s why I’m finding nothing when I search for target User:Newname?) [14:03:15] (03Abandoned) 10Tiziano Fogli: kafka: port mirror maker alerts from icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1077986 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [14:04:07] (03CR) 10Scott French: [C:03+2] P:etcd::tlsproxy: fix notify behavior for PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1164264 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:04:25] Lucas_WMDE: yes [14:04:34] ok [14:04:48] (03PS3) 10Ayounsi: CI: check style (black + isort) [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 [14:04:49] (03PS2) 10Ayounsi: Format existing python files using black and isort [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 [14:08:22] !log starting etcd cfssl-PKI migration in codfw - T352245 [14:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:27] PROBLEM - Host ms-be2081 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:27] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [14:09:16] (03CR) 10Ayounsi: "yup 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 (owner: 10Ayounsi) [14:09:20] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11119329 (10elukey) Back from holidays, I've ran the diff-testing tool again and the results are way better (sample-size 300): {F65915510} [14:09:21] (03CR) 10Ayounsi: [C:03+2] CI: check style (black + isort) [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 (owner: 10Ayounsi) [14:09:23] (03CR) 10Scott French: [C:03+2] hieradata: pilot cfssl/pki for nginx on conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1164298 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:09:31] (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 (owner: 10Ayounsi) [14:09:31] MatmaRex: ok looks like we’re still in 2017 https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=22950346 [14:09:46] if it is indeed going through chronologically [14:10:13] which also suggests the mismatches seen so far are probably just un-transactional-ness and not the bug you said was introduced in 202x [14:10:40] (03Merged) 10jenkins-bot: CI: check style (black + isort) [homer/public] - 10https://gerrit.wikimedia.org/r/1182145 (owner: 10Ayounsi) [14:10:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:11:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:11:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1052.eqiad.wmnet with OS bullseye [14:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:11:52] (03CR) 10Ayounsi: [C:03+2] Format existing python files using black and isort (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 (owner: 10Ayounsi) [14:12:06] thanks Lucas_WMDE. i guess i underestimated the number of renames we've done in the early years [14:12:41] i could probably make the scripts a bit faster with better queries, buuuut it's probably faster overall to just let them finish [14:13:02] yeah, I don’t think it’s a huge problem [14:13:03] thanks for running them. i'm away for a bit now [14:13:05] (03Merged) 10jenkins-bot: Format existing python files using black and isort [homer/public] - 10https://gerrit.wikimedia.org/r/1182147 (owner: 10Ayounsi) [14:13:12] and once this one finishes you have a better estimate for whoever will do the non-dry run ^^ [14:13:15] ok cya [14:13:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81767 and previous config saved to /var/cache/conftool/dbconfig/20250826-141325-fceratto.json [14:13:51] PROBLEM - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is CRITICAL: SSL CRITICAL - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:07:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cergen [14:14:13] ^^ swfrench-wmf expected? :) [14:14:27] vgutierrez: yes, thank you! [14:15:14] there's a race between when the expiry check reflects the newly configured threshold and when the cert changes [14:15:29] I wanted to make sure it "works" on the first one, but plan to downtime the other hosts we do [14:15:45] or rather, downtime the *service* on the other hosts we do [14:17:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:19:14] RECOVERY - Host ms-be2081 is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [14:19:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:19:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:51] 06SRE, 06cloud-services-team, 10Cloud-VPS, 10DNS, 06Traffic: PDNS in cloud can return inconsistent answers - https://phabricator.wikimedia.org/T281700#11119387 (10ssingh) 05Open→03Resolved a:03ssingh Some quick notes: - We are running `pdns-recursor` 4.8 in production, with an upgrade to 5 in... [14:23:07] (03PS19) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) [14:23:27] moritzm: alright, everything looks good on conf2006. please go ahead with the restart there. [14:23:48] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:24:25] (03CR) 10Arnaudb: [C:03+2] gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [14:25:10] swfrench-wmf: and done, all nginx worker threads on 2006 are running the new binary [14:25:25] moritzm: great, thank you1 [14:25:41] vgutierrez: that will most likely have dropped that open connection [14:25:44] PROBLEM - Host ms-be2082 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:26:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:26:55] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:27:19] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182156 [14:28:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T401906)', diff saved to https://phabricator.wikimedia.org/P81768 and previous config saved to /var/cache/conftool/dbconfig/20250826-142833-fceratto.json [14:28:34] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [14:28:38] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:28:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:28:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T401906)', diff saved to https://phabricator.wikimedia.org/P81769 and previous config saved to /var/cache/conftool/dbconfig/20250826-142857-fceratto.json [14:29:10] . /usr/local/bin/mwscript: line 124: 8 Killed [14:29:11] wat [14:29:18] Hmm [14:29:23] Lucas_WMDE: job id? [14:29:25] did it just run for too long / too much output? [14:29:37] Too long no, possibly OOM? [14:29:37] not sure how to find the job id [14:29:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:44] ok no worries [14:29:45] comment should be T313900 (dry run) [14:29:45] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [14:30:04] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182156 (owner: 10Arnaudb) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1430) [14:30:23] swfrench-wmf: I so, on lvs4009 it impacted the upload-httpslb_443 and upload-httpslb6_443 control planes, those two got restarted and they picked conf2006 again [14:30:41] vgutierrez@lvs4009:~$ netstat |grep conf2006 [14:30:42] tcp6 0 0 lvs4009.ulsfo.wmn:38534 conf2006.codfw.wmn:4001 ESTABLISHED [14:30:42] tcp6 0 0 lvs4009.ulsfo.wmn:38524 conf2006.codfw.wmn:4001 ESTABLISHED [14:30:43] claime: mw-script.eqiad.ieyltg2e I believe [14:30:43] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1052 [14:31:08] Warning BackoffLimitExceeded 109s job-controller Job has reached the specified backoff limit [14:31:09] hmm [14:31:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T401906)', diff saved to https://phabricator.wikimedia.org/P81770 and previous config saved to /var/cache/conftool/dbconfig/20250826-143109-fceratto.json [14:31:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1052 [14:32:02] Reason: OOMKilled [14:32:08] error: cannot exec into a container in a completed pod; current phase is Failed [14:32:08] oh no [14:32:22] So it hit 1GB ram and got oomkilled [14:32:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:32:51] vgutierrez: thanks for confirming! although the restarts are kind of a pain, it's nice to see them force new TLS connections (and thus positively verify that the new cert is accepted) [14:33:09] well, I’ll leave a comment on the task and MatmaRex can look into it later [14:33:12] thanks claime [14:33:28] Lucas_WMDE: Send me the task, I'll add the pod status and a couple graphs [14:33:46] oh huh [14:34:03] claime: https://phabricator.wikimedia.org/T313900 [14:34:12] RECOVERY - Host ms-be2082 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [14:34:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:30] ty [14:34:52] vgutierrez: the next host is conf2004, which is the codfw pybal host. I'll me monitoring pybal logs, for both steps (both the certificate change and the restart). let me know if there's anything you'd like me to confirm beyond that. [14:34:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:36:11] swfrench-wmf: hmm usually for pybal we switch it to another host and restart pybal [14:36:18] RECOVERY - etcd tlsproxy SSL conf2006.codfw.wmnet:4001 on conf2006 is OK: SSL OK - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:07:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/PKI [14:36:31] swfrench-wmf: so I'd suggest swithching pybal to conf2006 [14:36:36] *switching even [14:37:10] ^ recovery expected given the puppet run on alert1002 that just happened [14:37:41] MatmaRex: I expect in this case we shouldn’t try the other maintenance script yet? or do you think it has a better chance of success? [14:37:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:38:02] PROBLEM - Host thanos-be2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:22] I wouldn’t mind running it (the worst case scenario seems pretty harmless) but we can also save ourselves the CPU time if the expected result is the same ^^ [14:38:34] vgutierrez: ah, alright - the plan we'd put together [0] called for pybal restarts, but not moving them away from conf2004. we can change that if you think that's a better option. [14:38:35] [0] https://phabricator.wikimedia.org/T352245#10935894 [14:39:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:30] swfrench-wmf: yeah.. it should be fine [14:39:46] worse case scenario pybal will ignore etcd changes till it's restarted [14:39:49] (last famous words) [14:39:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.464 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:51] Lucas_WMDE: that depends on what's causing the memory use, and i don't know that yet. there's a good chance that the other script will work, e.g. if the problem is somewhere in the edit counting code, since that one doesn' [14:39:56] t interact with that [14:40:06] ok [14:40:28] claime: is it okay if I try running the other maintenance script then? [14:40:29] vgutierrez: cool, sounds good. I'll follow up here when the first step (certificate change) happens. [14:40:36] Lucas_WMDE: Sure [14:40:38] (I’m assuming an OOM in k8s is harmless outside that particular job) [14:40:39] ok [14:40:49] At worst it'll get killed the same when it hits 1GB :p [14:41:02] Yeah it is harmless [14:42:01] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: foreachwikiindblist sul CentralAuth:FixRenameUserLocalLogs --logwiki=metawiki # T398177 (dry run) [14:42:06] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [14:43:09] (03CR) 10Scott French: [C:03+2] hieradata: use cfssl/pki for nginx on all codfw configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:43:13] ok I think this one died basically immediately [14:44:06] This one is a script error [14:44:11] Terminated 255 [14:44:32] aawiki [6527b4d902563e7d75f3cc40] [no req] Error: Call to a member function getDBkey() on null [14:45:36] yeah I pasted it in the task [14:45:37] huh. thanks [14:45:38] (https://phabricator.wikimedia.org/T398177) [14:46:05] i handled a dozen different error conditions, but i did not think that the username in the log could be invalid… [14:46:14] I wasn’t sure if it would keep running beyond aawiki but it doesn’t look like it [14:46:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81772 and previous config saved to /var/cache/conftool/dbconfig/20250826-144616-fceratto.json [14:46:44] it wouldn't work on other wikis either, i think [14:47:02] anyway. i'll try to debug these, and get back to you tomorrow ;) thanks [14:47:08] sounds good :) [14:47:28] and thanks claime for jumping in :) [14:47:34] np :) [14:48:04] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Use %rt instead of %ID for log sequence numbers [puppet] - 10https://gerrit.wikimedia.org/r/1180918 (https://phabricator.wikimedia.org/T401383) (owner: 10Vgutierrez) [14:48:33] vgutierrez: moritzm: apologies for the delay. ISP issues ... back now. [14:48:47] no problem [14:50:23] vgutierrez: but, following up, if you do have particular concerns about the nginx restart vs. pybal, we can definitely separate that from the cfssl changes. [14:50:47] (03CR) 10Herron: [C:03+1] opensearch: selectively enable cluster health check [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite) [14:50:54] (03PS4) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) [14:50:54] as soon as you don't delay those restarts too much nope [14:50:57] *as long [14:52:11] (03CR) 10Krinkle: "Firefox stable now includes the PSL update to .beta.wmcloud.org, which means GeoIP cookies are invalid in the Beta Cluster today." [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle) [14:52:12] vgutierrez: sounds good. I think we're ready for that now. [14:52:28] PROBLEM - etcd tlsproxy SSL conf2004.codfw.wmnet:4001 on conf2004 is CRITICAL: SSL CRITICAL - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:45:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cergen [14:52:38] ^ expected [14:53:07] moritzm: you should be good to start nginx on conf2004 now [14:53:34] RECOVERY - Host thanos-be2005 is UP: PING WARNING - Packet loss = 77%, RTA = 30.20 ms [14:53:45] done, all worker threads are reloaded for the new binary [14:53:45] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:54:04] moritzm: ack, thanks! pybals I'm watching look happy so far [14:54:21] 2005 next or one of the eqiad ones first? [14:54:24] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:54:28] RECOVERY - etcd tlsproxy SSL conf2004.codfw.wmnet:4001 on conf2004 is OK: SSL OK - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:45:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/PKI [14:54:39] moritzm: 2005 will be next. I'll let you know when. [14:54:42] ok! [14:54:52] lvs2011 reconnecting as expected https://www.irccloud.com/pastebin/idDTipw0/ [14:55:03] look at that, pybal handled it better than I expected :) [14:55:09] :) [14:55:18] /31 [14:55:22] we got alerts for a mismatch of etcd connections [14:55:33] claime: please provide your local root password to continue [14:55:51] vgutierrez: ******* [14:56:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119531 (10Jhancock.wm) ms-be2081 is complete ms-be2082 is complete thanos-be2005 is complete. @MatthewVernon let us know when you are ready for anothe... [14:56:12] claime: wrong password, it's too short [14:56:55] vgutierrez: those alerts were just transient and converged after connections were back up, right? [14:56:56] (03CR) 10Herron: [C:03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [14:57:03] (i.e., that's what I see in icinga) [14:57:15] !log arnaudb@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet with reason: T402930 [14:57:20] T402930: Deploy Phabricator/Phorge 2025-08-26 - https://phabricator.wikimedia.org/T402930 [14:57:35] (03CR) 10Herron: [C:03+1] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:57:46] swfrench-wmf: yes.. we got a pybal alert that checks that the number of connections againts confd servers on port 4001 matches the number of configured services [14:58:10] poorman o11y to double check that pybal doesn't lose track of any configured service [14:58:17] vgutierrez: I thought there was a global exception to password policies for hunter2 [14:59:23] vgutierrez: heh, indeed - I recall that one takes a little while to converge (e.g., after a pybal restart) [15:00:05] conf2005 is now done [15:00:05] jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1500). [15:00:40] moritzm: you should be good to do conf2005 now. [15:00:59] vgutierrez: ^ conf2005 restart will happen momentarily [15:01:07] ack [15:01:15] I'll follow up with confd restarts etc. and we can call it a day for today :) [15:01:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81773 and previous config saved to /var/cache/conftool/dbconfig/20250826-150123-fceratto.json [15:01:25] i.e., codfw is fully migrated [15:01:36] PROBLEM - etcd tlsproxy SSL conf2005.codfw.wmnet:4001 on conf2005 is CRITICAL: SSL CRITICAL - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:55:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cergen [15:01:45] ^ expected - will resolve shortly [15:02:11] swfrench-wmf: done, all nginx worker threads refreshed on 2005 [15:02:17] moritzm: ack, thanks! [15:02:27] !log brennen@deploy1003 Started deploy [phabricator/deployment@27d2f0b]: deploy phab2002 for T402930 [15:02:38] T402930: Deploy Phabricator/Phorge 2025-08-26 - https://phabricator.wikimedia.org/T402930 [15:03:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [15:03:09] !log brennen@deploy1003 Finished deploy [phabricator/deployment@27d2f0b]: deploy phab2002 for T402930 (duration: 00m 42s) [15:03:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye [15:03:29] !log brennen@deploy1003 Started deploy [phabricator/deployment@27d2f0b]: deploy phab1004 for T402930 [15:03:34] (03CR) 10Krinkle: "With the change applied, it seems to now work in my limited testing:" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle) [15:04:07] !log brennen@deploy1003 Finished deploy [phabricator/deployment@27d2f0b]: deploy phab1004 for T402930 (duration: 00m 38s) [15:04:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 5.222 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [15:04:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119566 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye [15:04:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:36] RECOVERY - etcd tlsproxy SSL conf2005.codfw.wmnet:4001 on conf2005 is OK: SSL OK - Certificate etcd-v3.codfw.wmnet valid until 2025-09-23 14:55:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/PKI [15:05:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [15:06:03] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [15:06:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119572 (10MatthewVernon) [15:07:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119574 (10MatthewVernon) @Jhancock.wm Thanks! I'll ping you when the next ones are ready (but think 2-3 weeks from now most likely). [15:07:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:32] (03PS42) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:12:40] !log finished etcd cfssl-PKI migration in codfw - T352245 [15:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:45] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:16:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T401906)', diff saved to https://phabricator.wikimedia.org/P81774 and previous config saved to /var/cache/conftool/dbconfig/20250826-151630-fceratto.json [15:16:36] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:16:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:16:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T401906)', diff saved to https://phabricator.wikimedia.org/P81775 and previous config saved to /var/cache/conftool/dbconfig/20250826-151653-fceratto.json [15:17:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [15:18:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [15:18:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [15:19:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T401906)', diff saved to https://phabricator.wikimedia.org/P81776 and previous config saved to /var/cache/conftool/dbconfig/20250826-151905-fceratto.json [15:19:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [15:25:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [15:26:44] (03PS1) 10Muehlenhoff: Remove install4002/5002/6002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1182162 (https://phabricator.wikimedia.org/T396487) [15:28:20] (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1182163 [15:28:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [15:28:54] !log finished restart of all codfw-associated confds - T352245 [15:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:58] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:29:35] (03CR) 10Ssingh: [C:03+1] "Thank you!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1182163 (owner: 10BCornwall) [15:29:44] (03PS1) 10Ayounsi: Add black commit to .git-blame-ignore-revs [homer/public] - 10https://gerrit.wikimedia.org/r/1182164 [15:29:47] (03PS1) 10Muehlenhoff: homer: Update DHCP server in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1182165 (https://phabricator.wikimedia.org/T396487) [15:29:55] (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1182163 (owner: 10BCornwall) [15:30:23] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11119711 (10Jclark-ctr) a:05Jclark-ctr→03None [15:30:39] (03PS1) 10Muehlenhoff: Point webproxy in codfw to install2005 [dns] - 10https://gerrit.wikimedia.org/r/1182166 (https://phabricator.wikimedia.org/T396487) [15:30:42] (03PS1) 10Muehlenhoff: Assign installserver role to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182167 (https://phabricator.wikimedia.org/T396487) [15:30:43] (03PS1) 10Muehlenhoff: Update DHCP server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1182168 (https://phabricator.wikimedia.org/T396487) [15:31:13] (03PS4) 10Krinkle: varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) [15:31:30] (03CR) 10Krinkle: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [15:32:58] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task ssw1-d1-eqiad - https://phabricator.wikimedia.org/T401238#11119746 (10Jclark-ctr) 05Open→03Resolved [15:34:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81777 and previous config saved to /var/cache/conftool/dbconfig/20250826-153412-fceratto.json [15:34:22] (03PS1) 10Muehlenhoff: Update the proxied used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) [15:34:51] (03CR) 10Muehlenhoff: [C:03+2] Remove install4002/5002/6002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1182162 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [15:36:40] 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11119794 (10Andrew) [15:36:46] (03PS1) 10Andrew Bogott: Make cloudcephosd1052 an osd node [puppet] - 10https://gerrit.wikimedia.org/r/1182172 (https://phabricator.wikimedia.org/T395910) [15:37:36] (03PS2) 10Muehlenhoff: Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) [15:37:57] (03CR) 10Andrew Bogott: [C:03+2] Make cloudcephosd1052 an osd node [puppet] - 10https://gerrit.wikimedia.org/r/1182172 (https://phabricator.wikimedia.org/T395910) (owner: 10Andrew Bogott) [15:40:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2081.codfw.wmnet with OS bullseye [15:40:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119823 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2081.codfw.wmnet with OS bullseye complete... [15:43:49] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:43:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [15:44:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye complete... [15:46:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2005.codfw.wmnet with OS bullseye [15:46:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11119883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye comp... [15:48:49] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:49:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.824 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81778 and previous config saved to /var/cache/conftool/dbconfig/20250826-154920-fceratto.json [15:49:21] (03CR) 10FNegri: [C:03+1] Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [15:49:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:49:56] something is wrong with the git pull for the homer repo on cumin1003 [15:50:11] '/usr/bin/git pull --quiet' returned 1 instead of one of [0] [15:50:36] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11119921 (10thcipriani) >>! In T400119#11089795, @Joe wrote: >>>! In T400119#11086977, @bd808 wrote: >>>>! In T400119#11084530, @Samwilson wrote: >>> ~~Wil... [15:51:16] sukhe: root@cumin1002.eqiad.wmnet: Permission denied (publickey). [15:51:31] (03PS1) 10MVernon: swift: re-add 3 codfw hosts, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1182174 (https://phabricator.wikimedia.org/T354872) [15:52:00] (03PS1) 10Andrew Bogott: Increase contrack table size on cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) [15:52:37] (03PS2) 10Andrew Bogott: Increase contrack table size on cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) [15:52:42] ah no sorry [15:52:50] (03PS3) 10Andrew Bogott: Increase contrack table size on cloudcephosds [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) [15:53:10] (03PS4) 10Andrew Bogott: Increase contrack table size on cloudcephosds and mons [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) [15:53:13] sukhe: which repo? private, public, etc.. [15:53:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [15:53:38] (03CR) 10Ayounsi: [C:03+2] Add black commit to .git-blame-ignore-revs [homer/public] - 10https://gerrit.wikimedia.org/r/1182164 (owner: 10Ayounsi) [15:53:39] volans: sorry in a meeting [15:53:44] I only saw https://puppetboard.wikimedia.org/node/cumin1003.eqiad.wmnet because of the report earlier [15:53:45] !lof installing shadow security updates [15:53:49] FIRING: [3x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:39] sukhe: thx will take care of it [15:54:46] <3 [15:54:56] (03Merged) 10jenkins-bot: Add black commit to .git-blame-ignore-revs [homer/public] - 10https://gerrit.wikimedia.org/r/1182164 (owner: 10Ayounsi) [15:55:39] (03CR) 10David Caro: Increase contrack table size on cloudcephosds and mons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [15:57:06] (03PS5) 10Andrew Bogott: Increase contrack table size on cloudcephosds and mons [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) [15:57:14] (03CR) 10Andrew Bogott: Increase contrack table size on cloudcephosds and mons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [15:57:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [15:59:55] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182112 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [16:00:05] jhathaway and moritzm: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:43] sukhe: all cumin hosts fixed, puppet runs happily now, thanks for noticing [16:00:53] <3 [16:00:58] (03PS5) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [16:01:06] thanks -- I know you care about keeping them healthy and hence the ping :) [16:01:30] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.4 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182112 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [16:01:49] (03PS2) 10JHathaway: provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 [16:03:36] 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11120025 (10Andrew) [16:03:41] (03CR) 10JHathaway: provision: poll for reboot via Redfish (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [16:03:49] FIRING: [3x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:04:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T401906)', diff saved to https://phabricator.wikimedia.org/P81779 and previous config saved to /var/cache/conftool/dbconfig/20250826-160427-fceratto.json [16:04:34] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:04:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:04:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T401906)', diff saved to https://phabricator.wikimedia.org/P81780 and previous config saved to /var/cache/conftool/dbconfig/20250826-160451-fceratto.json [16:06:46] (03CR) 10David Caro: [C:03+1] "LGTM btw. once passing the tests" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [16:07:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T401906)', diff saved to https://phabricator.wikimedia.org/P81781 and previous config saved to /var/cache/conftool/dbconfig/20250826-160703-fceratto.json [16:08:07] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11120051 (10MoritzMuehlenhoff) >>! In T402584#11115783, @RLazarus wrote: >>>! In T402584#11113754, @MoritzMuehlenhoff wrote: >> We also have 237 baremetal host... [16:11:29] (03CR) 10Ayounsi: [C:03+1] Update DHCP server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1182168 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [16:11:34] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T352245) [16:11:39] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:12:02] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:08] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T352245) [16:14:19] ^ that `PyBal backends health check` (`k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled`) will likely turn critical again shortly (expected) [16:14:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:56] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:15:21] ^ expected [16:16:01] (03PS6) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [16:16:57] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:17:09] !log installing libxslt security updates [16:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:24] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:17:26] (03CR) 10Andrew Bogott: [C:04-1] "Cathal says "I don't think upping the limit or messing with the default conntrack timers is what is needed, it looks to me like one of tho" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [16:18:23] (03PS9) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [16:19:45] !log phabricator - added FCeratto-WMF to acl*sre-team [16:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:10] (03CR) 10Dreamy Jazz: "Like 9f482e62a36a50d99e2cb2d3934ef31ce9d31036 we also need to add a line to `maintenance.pp` and `periodic_jobs.pp`, as otherwise this won" [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:22:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81782 and previous config saved to /var/cache/conftool/dbconfig/20250826-162211-fceratto.json [16:22:55] (03PS1) 10MVernon: thanos - put thanos-be2005 back into rings [puppet] - 10https://gerrit.wikimedia.org/r/1182182 (https://phabricator.wikimedia.org/T400876) [16:23:12] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11120168 (10FCeratto-WMF) [16:23:15] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:23:49] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:24:19] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:24:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 9.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:21] (03CR) 10BCornwall: [C:03+1] varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:26:06] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:26:40] (03CR) 10BCornwall: [C:03+1] Point webproxy in codfw to install2005 [dns] - 10https://gerrit.wikimedia.org/r/1182166 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [16:28:05] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:28:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:29:15] (03CR) 10BCornwall: [C:03+2] varnish: Remove legacy `^(lge?|sie|nec|sgh|pg)` mobile regex [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:29:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.357 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:49] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:30:21] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:30:29] (03CR) 10BCornwall: [V:03+1 C:03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6751/co" [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:31:50] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:04] (03CR) 10Andrew Bogott: [C:04-1] "counterpoint: ceph docs suggest increasing it https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [16:37:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81783 and previous config saved to /var/cache/conftool/dbconfig/20250826-163718-fceratto.json [16:38:08] (03PS1) 10Papaul: Add bgg on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [16:39:26] (03CR) 10CI reject: [V:04-1] Add bgg on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [16:39:44] (03PS2) 10Papaul: Add BGP on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [16:41:02] (03CR) 10CI reject: [V:04-1] Add BGP on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [16:41:50] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:47:18] (03PS3) 10Papaul: Add BGP on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [16:47:19] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11120289 (10FCeratto-WMF) @Miriam thanks! [16:47:50] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11120290 (10FCeratto-WMF) [16:48:39] (03CR) 10CI reject: [V:04-1] Add BGP on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [16:51:01] (03CR) 10BCornwall: [V:03+2 C:03+2] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:52:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T401906)', diff saved to https://phabricator.wikimedia.org/P81784 and previous config saved to /var/cache/conftool/dbconfig/20250826-165226-fceratto.json [16:52:28] (03PS4) 10Papaul: Add BGP on mr1-ulsfo and temporary remove repalce ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [16:52:31] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [16:52:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance [16:52:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T401906)', diff saved to https://phabricator.wikimedia.org/P81785 and previous config saved to /var/cache/conftool/dbconfig/20250826-165248-fceratto.json [16:53:01] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11120338 (10FCeratto-WMF) [16:53:20] (03PS5) 10Papaul: Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [16:54:48] (03CR) 10CI reject: [V:04-1] Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [16:55:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T401906)', diff saved to https://phabricator.wikimedia.org/P81786 and previous config saved to /var/cache/conftool/dbconfig/20250826-165501-fceratto.json [16:55:06] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:55:56] (03PS5) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [16:56:14] (03PS1) 10Ebernhardson: cirrus: Reduce galleries weight in search on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182186 (https://phabricator.wikimedia.org/T401590) [16:59:54] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2042.codfw.wmnet [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T1700) [17:00:09] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp2042.codfw.wmnet [17:02:59] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp2042.codfw.wmnet [17:03:02] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cp2042.codfw.wmnet [17:03:21] o/ [17:03:39] I'll be deploying a mediawiki-config change as part of the infra window in a few minutes [17:03:40] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182187 [17:05:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:06:26] (03Merged) 10jenkins-bot: image-suggestion: cleanup unused refs to service listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171703 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:06:51] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]] [17:06:56] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:09:44] (03PS1) 10Andrew Bogott: Keystone hooks: speed up domain creation [puppet] - 10https://gerrit.wikimedia.org/r/1182188 (https://phabricator.wikimedia.org/T398712) [17:09:45] (03PS1) 10Andrew Bogott: wmfkeystonehooks: format with Black [puppet] - 10https://gerrit.wikimedia.org/r/1182189 [17:09:46] (03PS1) 10Andrew Bogott: designatemakedomain.py: format with Black [puppet] - 10https://gerrit.wikimedia.org/r/1182190 [17:10:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81787 and previous config saved to /var/cache/conftool/dbconfig/20250826-171008-fceratto.json [17:12:53] !log swfrench@deploy1003 eevans, swfrench: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:12:59] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:13:45] !log swfrench@deploy1003 eevans, swfrench: Continuing with sync [17:16:32] (03PS1) 10Ebernhardson: cirrus: Stop using auto_expand_replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) [17:19:06] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171703|image-suggestion: cleanup unused refs to service listener (T368096)]] (duration: 12m 15s) [17:19:11] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:20:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:21:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:23:17] ^ spurious 503 when the httpbb check coincided with a deployment (manual run succeeded) [17:25:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81788 and previous config saved to /var/cache/conftool/dbconfig/20250826-172516-fceratto.json [17:26:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:28] (03PS2) 10Dzahn: mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 [17:29:19] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-For-Review: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480#11120500 (10BCornwall) If this is an "official" policy, should https://wikitech... [17:30:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:26] (03PS3) 10Dzahn: mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 [17:32:49] (03CR) 10Dzahn: "for one reason" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [17:33:16] (03PS4) 10Dzahn: mariadb: replace legacy fact for memorysize [puppet] - 10https://gerrit.wikimedia.org/r/1180999 [17:33:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11120530 (10phaultfinder) [17:38:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11120560 (10phaultfinder) [17:39:45] (03CR) 10Dzahn: "same on trixie and bookworm:" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [17:40:17] (03CR) 10Dzahn: "see my test example in comments. it does behave differently on trixie, it must be the ruby version since puppet version stays the same" [puppet] - 10https://gerrit.wikimedia.org/r/1180999 (owner: 10Dzahn) [17:40:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T401906)', diff saved to https://phabricator.wikimedia.org/P81789 and previous config saved to /var/cache/conftool/dbconfig/20250826-174023-fceratto.json [17:40:29] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [17:40:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2197.codfw.wmnet with reason: Maintenance [17:40:59] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance [17:41:02] (03CR) 10Scott French: [C:03+1] "Nice! This "deflates" `start` in a way that makes it a lot more readable, and agreed that "how much `args` do you need?" sets a good bound" [puppet] - 10https://gerrit.wikimedia.org/r/1181824 (owner: 10RLazarus) [17:41:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T401906)', diff saved to https://phabricator.wikimedia.org/P81790 and previous config saved to /var/cache/conftool/dbconfig/20250826-174106-fceratto.json [17:42:20] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [17:42:30] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [17:43:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T401906)', diff saved to https://phabricator.wikimedia.org/P81791 and previous config saved to /var/cache/conftool/dbconfig/20250826-174319-fceratto.json [17:43:31] !log ammarpad@deploy1003 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki 'API:Main page' 'API:Action API' Ammarpad '--reason=per [[:phab:T402800]]' # T402800 [17:43:36] T402800: Request to move translatable page: mw:API:Main_page - https://phabricator.wikimedia.org/T402800 [17:44:39] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on people1005.eqiad.wmnet with reason: T402596 [17:44:44] T402596: upgrade people servers to trixie - https://phabricator.wikimedia.org/T402596 [17:46:54] (03CR) 10Dzahn: [V:03+1 C:03+2] lists::monitoring: increase check_interval to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/1181213 (owner: 10Dzahn) [17:56:36] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11120660 (10Dzahn) [17:57:59] 06SRE: provide envoyproxy package on trixie - https://phabricator.wikimedia.org/T402668#11120682 (10Dzahn) [17:58:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81793 and previous config saved to /var/cache/conftool/dbconfig/20250826-175827-fceratto.json [17:59:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:59:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:02:50] (03PS2) 10Ebernhardson: cirrus: Drop absented periodic_job (part 2) [puppet] - 10https://gerrit.wikimedia.org/r/1169210 [18:04:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:06:29] (03PS7) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [18:09:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1170433/6764/" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:13:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81794 and previous config saved to /var/cache/conftool/dbconfig/20250826-181334-fceratto.json [18:14:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:14:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.868 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:01] (03PS8) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [18:22:17] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959 (10Lydia_Pintscher) 03NEW [18:22:51] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [18:23:32] (03PS9) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [18:24:39] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [18:25:38] 10ops-eqiad, 06cloud-services-team, 06DC-Ops: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T402938#11120820 (10Jclark-ctr) [18:25:59] (03PS10) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [18:28:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T401906)', diff saved to https://phabricator.wikimedia.org/P81795 and previous config saved to /var/cache/conftool/dbconfig/20250826-182842-fceratto.json [18:28:48] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [18:28:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2217.codfw.wmnet with reason: Maintenance [18:29:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T401906)', diff saved to https://phabricator.wikimedia.org/P81796 and previous config saved to /var/cache/conftool/dbconfig/20250826-182905-fceratto.json [18:29:20] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.444 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:51] 06SRE, 06Traffic, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11120837 (10ssingh) >>! In T402284#11101087, @fnegri wrote: > @Dzahn fine with me, but if there's an easy way to keep e.g. a 5-minute cache it could be nice to have. I'll let @Joe... [18:31:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T401906)', diff saved to https://phabricator.wikimedia.org/P81797 and previous config saved to /var/cache/conftool/dbconfig/20250826-183118-fceratto.json [18:33:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:35:07] (03PS1) 10Jforrester: Provide abstractwiki-rust-1.85 based on Trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182201 [18:38:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:39:33] (03CR) 10Dzahn: [V:03+1 C:03+2] "confirmed it was a noop on all 3 servers" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:42:16] 06SRE, 06Traffic, 13Patch-For-Review: apt-staging: add headers to prevent CDN caching - https://phabricator.wikimedia.org/T402284#11120914 (10MoritzMuehlenhoff) >>! In T402284#11120837, @ssingh wrote: >>>! In T402284#11101087, @fnegri wrote: >> @Dzahn fine with me, but if there's an easy way to keep e.g. a 5... [18:42:28] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to envoy-future:1.26.8-2 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181806 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [18:44:06] (03Merged) 10jenkins-bot: mathoid: Upgrade to envoy-future:1.26.8-2 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181806 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [18:44:16] (03CR) 10Ssingh: [C:03+1] "We have a confirmation from @mmuhlenhoff@wikimedia.org on the task as well." [puppet] - 10https://gerrit.wikimedia.org/r/1180234 (https://phabricator.wikimedia.org/T402284) (owner: 10Dzahn) [18:46:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81798 and previous config saved to /var/cache/conftool/dbconfig/20250826-184625-fceratto.json [18:47:17] (03CR) 10Eevans: [C:03+1] thanos - put thanos-be2005 back into rings [puppet] - 10https://gerrit.wikimedia.org/r/1182182 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [18:49:00] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182202 [18:49:35] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [18:50:06] (03CR) 10Eevans: [C:03+1] swift: re-add 3 codfw hosts, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1182174 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [18:50:52] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182202 (owner: 10Ahmon Dancy) [18:51:45] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182202 (owner: 10Ahmon Dancy) [18:51:49] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:54:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:56:33] (03PS1) 10Ssingh: dumps: remove link to spam domain [puppet] - 10https://gerrit.wikimedia.org/r/1182203 [18:59:45] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [19:01:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81799 and previous config saved to /var/cache/conftool/dbconfig/20250826-190133-fceratto.json [19:02:54] (03CR) 10BCornwall: [C:03+2] dumps: remove link to spam domain [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [19:03:03] (03CR) 10BCornwall: [C:03+1] dumps: remove link to spam domain [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [19:03:23] (03CR) 10Ssingh: [C:03+2] dumps: remove link to spam domain [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [19:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:14:20] (03CR) 10Ayounsi: [C:03+1] homer: Update DHCP server in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1182165 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [19:14:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 3.860 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T401906)', diff saved to https://phabricator.wikimedia.org/P81800 and previous config saved to /var/cache/conftool/dbconfig/20250826-191640-fceratto.json [19:16:46] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [19:16:57] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2224.codfw.wmnet with reason: Maintenance [19:17:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T401906)', diff saved to https://phabricator.wikimedia.org/P81801 and previous config saved to /var/cache/conftool/dbconfig/20250826-191702-fceratto.json [19:18:56] !log dancy@deploy1003 Installing scap version "4.210.0" for 2 host(s) [19:19:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T401906)', diff saved to https://phabricator.wikimedia.org/P81802 and previous config saved to /var/cache/conftool/dbconfig/20250826-191915-fceratto.json [19:19:26] (03PS1) 10Bking: opensearch-cluster: add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182205 (https://phabricator.wikimedia.org/T397246) [19:20:01] (03Abandoned) 10Bking: opensearch-cluster: add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182205 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:20:44] !log dancy@deploy1003 Installation of scap version "4.210.0" completed for 2 hosts [19:21:14] !log dancy@deploy1003 Started scap sync-world: Testing T402508 fix [19:21:18] T402508: Exception while building "next" image - https://phabricator.wikimedia.org/T402508 [19:22:16] * swfrench-wmf is around and excited to see how this goes [19:23:25] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11121128 (10ayounsi) If I understand correctly we currently get some "per rack" load balancing, where `E3` might randomly prefer `E1` but servers in `E4` m... [19:24:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:25:04] (03PS1) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [19:26:25] (03PS5) 10Bking: Introduce opensearch-operator-crds chart (1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1173947 (https://phabricator.wikimedia.org/T397246) [19:29:05] (03PS1) 10Ahmon Dancy: Revert "scap.cfg.erb: Disable build_mw_next_container_image" [puppet] - 10https://gerrit.wikimedia.org/r/1182207 [19:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:33:06] dancy: is now a good time to merge your revert to reenable `build_mw_next_container_image`? [19:33:19] in a few minutes please [19:33:27] ack, sounds good :) [19:34:03] ah, I misread the backscroll - I see your initial test is still running [19:34:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81803 and previous config saved to /var/cache/conftool/dbconfig/20250826-193422-fceratto.json [19:39:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:39:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:41:47] (03PS13) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [19:43:11] (03PS14) 10Bking: opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [19:44:21] swfrench-wmf: Ready! [19:44:28] (03PS2) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [19:44:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:45:21] dancy: ah, great! [19:45:45] (03CR) 10Scott French: [C:03+2] Revert "scap.cfg.erb: Disable build_mw_next_container_image" [puppet] - 10https://gerrit.wikimedia.org/r/1182207 (owner: 10Ahmon Dancy) [19:46:48] (03CR) 10Dzahn: "+1 - fwiw the "dvd.wikimedia.org" link right above that does not seem to exist" [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [19:47:43] dancy: I'll go ahead and run the agent on deploy1003 to pick up the change? [19:47:51] Yes please [19:48:36] * swfrench-wmf is doing [19:48:47] ETA ~ 5 minutes or so before the change is live [19:49:03] (03PS3) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [19:49:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81804 and previous config saved to /var/cache/conftool/dbconfig/20250826-194930-fceratto.json [19:49:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 2.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:50:45] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [19:51:32] (03CR) 10Ssingh: [C:03+2] dumps: remove link to spam domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [19:53:18] dancy: done! [19:53:40] (03PS1) 10RLazarus: envoy-future: Sync dockerfile changes from envoy image to envoy-future [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182209 (https://phabricator.wikimedia.org/T402584) [19:54:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:58:41] (03CR) 10RLazarus: [V:03+2] "Tested with local docker-pkg, and an envoy.yaml using /var/run/envoy/admin.sock as we do in prod." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182209 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [19:59:13] (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.26.8-3 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182210 (https://phabricator.wikimedia.org/T402584) [19:59:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.784 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:57] (03CR) 10Cwhite: [C:03+2] opensearch: selectively enable cluster health check [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:02:21] I'll use some of the backport window to continue scap testing [20:04:34] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:04:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T401906)', diff saved to https://phabricator.wikimedia.org/P81805 and previous config saved to /var/cache/conftool/dbconfig/20250826-200437-fceratto.json [20:04:44] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [20:05:59] (03CR) 10RLazarus: [C:03+2] deployment_server: mwscript_k8s refactor [puppet] - 10https://gerrit.wikimedia.org/r/1181824 (owner: 10RLazarus) [20:09:40] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 124.66 ms [20:12:50] (03PS15) 10Bking: opensearch-operator: Add chart for review (2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [20:14:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:14] !log dancy@deploy1003 Finished scap sync-world: Testing T402508 fix (duration: 55m 01s) [20:16:20] T402508: Exception while building "next" image - https://phabricator.wikimedia.org/T402508 [20:16:36] !log dancy@deploy1003 Started scap sync-world: Testing T402508 fix (phase 2) [20:21:51] (03CR) 10Dzahn: dumps: remove link to spam domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182203 (owner: 10Ssingh) [20:24:22] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.806 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:26:15] (03PS4) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:26:41] (03PS5) 10Bking: opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) [20:28:14] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add chart for review (3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182206 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [20:28:22] 10ops-esams, 06SRE, 06DC-Ops: Inbound errors on interface cr1-esams:xe-0/0/8 (Transit: Arelion (IC-381309) {#30386}) - https://phabricator.wikimedia.org/T393213#11121371 (10RobH) 05Open→03In progress a:03cmooney @cmooney, So we have this faulty link on: ` xe-0/0/8 up up Transit: Arelion... [20:30:01] !log dancy@deploy1003 Finished scap sync-world: Testing T402508 fix (phase 2) (duration: 13m 25s) [20:30:06] T402508: Exception while building "next" image - https://phabricator.wikimedia.org/T402508 [20:34:28] (03PS1) 10Ahmon Dancy: wmf-config/InitialiseSettings-dev.php: Disable wmgUseContentTranslation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182212 [20:34:50] (03CR) 10Ahmon Dancy: [C:03+2] wmf-config/InitialiseSettings-dev.php: Disable wmgUseContentTranslation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182212 (owner: 10Ahmon Dancy) [20:35:39] (03Merged) 10jenkins-bot: wmf-config/InitialiseSettings-dev.php: Disable wmgUseContentTranslation [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1182212 (owner: 10Ahmon Dancy) [20:41:15] (03CR) 10Scott French: [C:03+1] envoy-future: Sync dockerfile changes from envoy image to envoy-future [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182209 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:42:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:47:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:08] (03PS1) 10Dzahn: zuul: fix syntax issue in nodepool.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1182214 (https://phabricator.wikimedia.org/T401614) [20:52:14] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Sync dockerfile changes from envoy image to envoy-future [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182209 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [20:52:23] (03CR) 10CI reject: [V:04-1] zuul: fix syntax issue in nodepool.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1182214 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [20:52:33] (03CR) 10Dzahn: [C:03+2] zuul: fix syntax issue in nodepool.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1182214 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [20:52:42] (03PS2) 10Dzahn: zuul: fix syntax issue in nodepool.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1182214 (https://phabricator.wikimedia.org/T401614) [20:56:57] (03CR) 10Dzahn: [C:03+2] zuul: fix syntax issue in nodepool.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1182214 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [20:59:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T2100) [21:02:23] (03PS6) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:11:07] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:12:22] Got a few deploys today. [21:14:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:15:14] (03PS1) 10Jdlrobson: Consolidate search config to match Minerva [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182215 (https://phabricator.wikimedia.org/T397084) [21:15:17] (03PS1) 10Jdlrobson: Add support for typeahead search options in config [skins/MinervaNeue] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182216 (https://phabricator.wikimedia.org/T402051) [21:16:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [21:16:59] (03Merged) 10jenkins-bot: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [21:17:30] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1179750|Update vector search config with new wgVectorTypeahead (T397084)]] [21:17:36] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [21:19:13] (03PS1) 10Dzahn: zuul: provide zookeeper-tls config values in the correct format [puppet] - 10https://gerrit.wikimedia.org/r/1182217 (https://phabricator.wikimedia.org/T401614) [21:19:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:14] (03PS2) 10Dduvall: profile::buildkitd: Standalone buildkitd profile [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) [21:22:14] (03PS2) 10Dduvall: deployment_server: buildkitd for MediaWiki image builds [puppet] - 10https://gerrit.wikimedia.org/r/1181199 (https://phabricator.wikimedia.org/T392526) [21:22:58] (03CR) 10Dduvall: "Fixed" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [21:23:35] !log jdlrobson@deploy1003 bwang, jdlrobson: Backport for [[gerrit:1179750|Update vector search config with new wgVectorTypeahead (T397084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:23:40] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [21:24:36] bwang: we are up and ready to test on debug servers [21:24:40] ok [21:26:09] (03PS1) 10Ryan Kemper: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) [21:26:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:26] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [21:26:40] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [21:30:24] (03PS2) 10Ryan Kemper: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) [21:31:02] (03PS7) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:31:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [21:32:14] (debugging an issue before syncing) [21:33:15] (03PS3) 10Ryan Kemper: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) [21:33:29] (03PS2) 10Dzahn: zuul: provide zookeeper-tls config in correct format, add ca cert [puppet] - 10https://gerrit.wikimedia.org/r/1182217 (https://phabricator.wikimedia.org/T401614) [21:33:33] (03PS4) 10Ryan Kemper: opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) [21:33:44] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [21:33:45] (03CR) 10CI reject: [V:04-1] zuul: provide zookeeper-tls config in correct format, add ca cert [puppet] - 10https://gerrit.wikimedia.org/r/1182217 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [21:34:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [21:34:39] (03PS3) 10Dzahn: zuul: provide zookeeper-tls config in correct format, add ca cert [puppet] - 10https://gerrit.wikimedia.org/r/1182217 (https://phabricator.wikimedia.org/T401614) [21:34:53] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to envoy-future:1.26.8-3 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182210 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [21:34:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:36:27] !log jdlrobson@deploy1003 Sync cancelled. [21:36:34] cancelling due to an issue. [21:38:05] (03CR) 10Dzahn: [C:03+2] zuul: provide zookeeper-tls config in correct format, add ca cert [puppet] - 10https://gerrit.wikimedia.org/r/1182217 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [21:39:22] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:39:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 8.616 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 5.658 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:40:36] (03PS1) 10Eevans: Revert "data-gateway: enable debug logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182220 [21:41:02] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6765/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [21:41:07] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to envoy-future:1.26.8-3 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182210 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [21:43:01] (03Merged) 10jenkins-bot: mathoid: Upgrade to envoy-future:1.26.8-3 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182210 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [21:43:28] (03PS8) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:44:44] (03PS43) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [21:45:40] (03PS1) 10Jdlrobson: Explicitly define enwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182221 (https://phabricator.wikimedia.org/T397084) [21:46:04] (03PS2) 10Jdlrobson: Explicitly define enwiki wgVectorTypeahead config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182221 (https://phabricator.wikimedia.org/T397084) [21:46:12] (03CR) 10Scott French: [C:03+1] Revert "data-gateway: enable debug logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182220 (owner: 10Eevans) [21:46:57] (attempting a follow up) [21:48:31] (03CR) 10Bernard Wang: [C:03+1] Explicitly define enwiki wgVectorTypeahead config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182221 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [21:49:13] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6766/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [21:49:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/MinervaNeue] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182216 (https://phabricator.wikimedia.org/T402051) (owner: 10Jdlrobson) [21:49:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182221 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [21:50:35] (03Merged) 10jenkins-bot: Explicitly define enwiki wgVectorTypeahead config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182221 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [21:50:53] (03Merged) 10jenkins-bot: Add support for typeahead search options in config [skins/MinervaNeue] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182216 (https://phabricator.wikimedia.org/T402051) (owner: 10Jdlrobson) [21:51:19] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182216|Add support for typeahead search options in config (T402051)]], [[gerrit:1182221|Explicitly define enwiki wgVectorTypeahead config (T397084)]] [21:51:25] T402051: Wikidata shows thumbnails in search suggestions on mobile - https://phabricator.wikimedia.org/T402051 [21:51:26] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [21:52:01] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:52:32] (03PS1) 10Dzahn: zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) [21:53:33] (03PS2) 10Dzahn: zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) [21:54:40] (03PS44) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [21:55:52] (03PS3) 10Dzahn: zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) [21:55:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:55:56] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6767/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [21:57:37] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1182216|Add support for typeahead search options in config (T402051)]], [[gerrit:1182221|Explicitly define enwiki wgVectorTypeahead config (T397084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:57:44] T402051: Wikidata shows thumbnails in search suggestions on mobile - https://phabricator.wikimedia.org/T402051 [21:57:44] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [21:58:28] (03PS4) 10Dzahn: zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) [22:00:02] (03CR) 10BryanDavis: "This removed the traffic management layer that we have been using to manage unwanted bot traffic in Beta Cluster. T393487" [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [22:00:35] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:00:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:01:09] (03PS6) 10Papaul: Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [22:01:14] (03PS1) 10Xcollazo: dumps: remove dead links. [puppet] - 10https://gerrit.wikimedia.org/r/1182225 (https://phabricator.wikimedia.org/T402976) [22:02:32] (03CR) 10CI reject: [V:04-1] Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [22:03:20] (03PS7) 10Papaul: Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [22:04:09] (03PS5) 10Dzahn: zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) [22:04:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:04:31] (03CR) 10CDobbins: [V:03+1] dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [22:04:55] (03CR) 10CI reject: [V:04-1] Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [22:04:58] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:05:59] (03PS1) 10Bernard Wang: Fix VectorTypeahead config to avoid + [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182227 [22:07:00] running over my window sorry y'all [22:07:27] let me know if you need the deploy conch [22:07:54] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182216|Add support for typeahead search options in config (T402051)]], [[gerrit:1182221|Explicitly define enwiki wgVectorTypeahead config (T397084)]] (duration: 16m 35s) [22:08:00] T402051: Wikidata shows thumbnails in search suggestions on mobile - https://phabricator.wikimedia.org/T402051 [22:08:01] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [22:08:12] Jdlrobson: I'll mess with prod a little when you're done, but no rush :) [22:09:03] (03PS1) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 [22:09:03] (03PS1) 10Krinkle: etcd: Add x1 and x3 as alias for extension1 and extension3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182229 [22:09:49] ok beginning last backport now [22:09:50] (03CR) 10CI reject: [V:04-1] tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 (owner: 10Krinkle) [22:09:53] will let you know rzl when done [22:09:53] (03CR) 10CI reject: [V:04-1] etcd: Add x1 and x3 as alias for extension1 and extension3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182229 (owner: 10Krinkle) [22:10:01] thanks! [22:10:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182227 (owner: 10Bernard Wang) [22:10:46] So bwang I think the result we are expecting for the above is thumbnails disappearing on search results on https://test.m.wikipedia.org/wiki/Main_Page - is that correct? [22:10:51] (03Merged) 10jenkins-bot: Fix VectorTypeahead config to avoid + [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182227 (owner: 10Bernard Wang) [22:10:53] If so we should be in a good place to leave it [22:11:18] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182227|Fix VectorTypeahead config to avoid +]] [22:11:19] correct [22:11:55] (03PS2) 10Krinkle: tests: Add test for wmfApplyEtcdDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182228 [22:11:55] (03PS2) 10Krinkle: etcd: Add x1 and x3 as alias for extension1 and extension3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182229 [22:12:02] 🤞 [22:14:39] bwang: it's not quite ready but debug servers are looking promising [22:16:24] (03CR) 10Dzahn: [C:03+1] dumps: remove dead links. [puppet] - 10https://gerrit.wikimedia.org/r/1182225 (https://phabricator.wikimedia.org/T402976) (owner: 10Xcollazo) [22:17:14] !log jdlrobson@deploy1003 bwang, jdlrobson: Backport for [[gerrit:1182227|Fix VectorTypeahead config to avoid +]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:17:27] ^ bwang [22:18:23] It looks good to me [22:18:28] Tested on office and test wiki [22:18:34] And en looks good [22:18:35] !log jdlrobson@deploy1003 bwang, jdlrobson: Continuing with sync [22:19:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:48] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54681 bytes in 0.129 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:55] (03CR) 10Jdlrobson: [C:03+1] "We should be good to backport this tomorrow ." [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182215 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [22:23:47] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182227|Fix VectorTypeahead config to avoid +]] (duration: 12m 28s) [22:23:57] ok rzl all yours! [22:24:25] Jdlrobson: thanks! [22:25:08] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [22:29:31] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [22:30:44] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1182223/6770/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [22:30:53] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mathoid: apply [22:31:29] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [22:33:00] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mathoid: apply [22:33:31] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [22:35:10] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool es2039* gradually with 4 steps - Work done [22:37:28] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:38:00] ^ $ helmfile -e eqiad -i apply -l name=pinkunicorn --set mesh.image_name=envoy_future --set mesh.image_version=1.26.8-3 --context=5 [22:39:58] envoy-future, rather :) [22:40:01] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:40:07] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:40:28] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:43:46] (03CR) 10Dzahn: [C:03+2] zuul::main: add zookeeper server IP as parameter, do DNS lookup [puppet] - 10https://gerrit.wikimedia.org/r/1182223 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [22:45:44] envoy-future 1.26.8 looks good on both mw-debug and mathoid -- I'm going to restore mw-debug to repo values then start promoting regular envoy to the new version [22:49:35] (only one config field deprecation warning in the logs, too) [22:49:54] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:50:15] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:57:39] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11121811 (10RLazarus) Validated on mathoid and mw-debug (mathoid still on envoy-future, mw-debug back on 1.23 for now). One config warning in the logs from mw-debug: ` [2025-08-26... [22:58:59] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11121812 (10Josve05a) For preservation of information the user blanked both their gadget script and their common.js following a request on their user talk page.... [23:00:03] (03PS3) 10Scott French: trafficserver: rename mw-php-migration to mw-next-routing [puppet] - 10https://gerrit.wikimedia.org/r/1154900 (https://phabricator.wikimedia.org/T391421) [23:00:06] (03PS4) 10Scott French: trafficserver: generalize mw-next-routing.lua and prep for PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1154901 (https://phabricator.wikimedia.org/T391421) [23:01:15] !log reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.26.8-1_amd64.deb # T402584 [23:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:20] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [23:01:54] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [23:02:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T402925)', diff saved to https://phabricator.wikimedia.org/P81809 and previous config saved to /var/cache/conftool/dbconfig/20250826-230201-ladsgroup.json [23:02:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:10:59] (03CR) 10Cwhite: [C:03+1] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [23:11:32] 07Puppet, 06SRE, 10Readers Essential Work 2025 (Simplify MobileFrontend): Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#11121841 (10Jdlrobson-WMF) [23:16:42] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T402925)', diff saved to https://phabricator.wikimedia.org/P81811 and previous config saved to /var/cache/conftool/dbconfig/20250826-231641-ladsgroup.json [23:16:46] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:19:49] (03CR) 10Cwhite: mirrormaker: add alerts directly in Prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [23:20:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2039* gradually with 4 steps - Work done [23:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:31:49] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81813 and previous config saved to /var/cache/conftool/dbconfig/20250826-233148-ladsgroup.json [23:34:41] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11121920 (10RLazarus) More deprecation warnings from the API Gateway (started locally after modifying charts/api-gateway/values-devel.yaml to use envoy-future: ` [source/common/pro... [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182240 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182240 (owner: 10TrainBranchBot) [23:46:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81814 and previous config saved to /var/cache/conftool/dbconfig/20250826-234656-ladsgroup.json [23:52:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182240 (owner: 10TrainBranchBot) [23:53:57] (03CR) 10Dduvall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [23:55:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency