[00:02:14] (03CR) 10Dzahn: [C:03+2] lists: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [00:08:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526 [00:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526 (owner: 10TrainBranchBot) [00:09:08] (03CR) 10BCornwall: [C:03+1] create th.wikimedia.org for Wikimedia Thailand [dns] - 10https://gerrit.wikimedia.org/r/1177522 (https://phabricator.wikimedia.org/T400001) (owner: 10Dzahn) [00:10:00] (03CR) 10Dzahn: [C:03+2] "this file is not actually used.. wtf" [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [00:10:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P81002 and previous config saved to /var/cache/conftool/dbconfig/20250812-001022-ladsgroup.json [00:10:30] (03PS1) 10Dzahn: Revert "lists: add NEL headers to apache" [puppet] - 10https://gerrit.wikimedia.org/r/1177527 [00:10:44] (03CR) 10Dzahn: [C:03+2] create th.wikimedia.org for Wikimedia Thailand [dns] - 10https://gerrit.wikimedia.org/r/1177522 (https://phabricator.wikimedia.org/T400001) (owner: 10Dzahn) [00:11:02] !log dzahn@dns1004 START - running authdns-update [00:11:43] (03CR) 10Dzahn: [C:03+2] Revert "lists: add NEL headers to apache" [puppet] - 10https://gerrit.wikimedia.org/r/1177527 (owner: 10Dzahn) [00:12:03] !log dzahn@dns1004 END - running authdns-update [00:15:48] (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [00:17:24] (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [00:17:29] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [00:25:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T400854)', diff saved to https://phabricator.wikimedia.org/P81003 and previous config saved to /var/cache/conftool/dbconfig/20250812-002530-ladsgroup.json [00:25:34] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:25:46] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [00:25:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81004 and previous config saved to /var/cache/conftool/dbconfig/20250812-002553-ladsgroup.json [00:28:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81005 and previous config saved to /var/cache/conftool/dbconfig/20250812-002841-ladsgroup.json [00:31:07] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526 (owner: 10TrainBranchBot) [00:43:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P81006 and previous config saved to /var/cache/conftool/dbconfig/20250812-004349-ladsgroup.json [00:45:10] 06SRE, 06Traffic-Icebox: Consider using vmod_var instead of temporary headers in VCL - https://phabricator.wikimedia.org/T198620#11076327 (10BCornwall) 05Open→03Invalid This work has actually been ongoing and is tracked in T373550. Closing as a duplicate. [00:45:23] 06SRE, 06Traffic-Icebox: Consider using vmod_var instead of temporary headers in VCL - https://phabricator.wikimedia.org/T198620#11076332 (10BCornwall) →14Duplicate dup:03T373550 [00:50:44] (03CR) 10RLazarus: [C:03+1] mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [00:58:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P81007 and previous config saved to /var/cache/conftool/dbconfig/20250812-005856-ladsgroup.json [01:04:23] (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:07:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375) [01:07:54] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [01:10:22] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:10:58] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:11:08] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [01:14:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81008 and previous config saved to /var/cache/conftool/dbconfig/20250812-011403-ladsgroup.json [01:14:08] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:14:20] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [01:14:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81009 and previous config saved to /var/cache/conftool/dbconfig/20250812-011427-ladsgroup.json [01:18:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81010 and previous config saved to /var/cache/conftool/dbconfig/20250812-011817-ladsgroup.json [01:19:55] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:20:07] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:21:36] (03PS6) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:21:44] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [01:22:12] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [01:22:13] (03PS1) 10Samwilson: InitialiseSettings-labs.php: Fix typo in wgWikisourceEnableBulkOCR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177548 (https://phabricator.wikimedia.org/T400281) [01:24:44] (03Abandoned) 10Samwilson: InitialiseSettings-labs.php: Fix typo in wgWikisourceEnableBulkOCR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177548 (https://phabricator.wikimedia.org/T400281) (owner: 10Samwilson) [01:26:34] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1044.eqiad.wmnet with OS bookworm [01:26:41] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [01:33:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P81011 and previous config saved to /var/cache/conftool/dbconfig/20250812-013325-ladsgroup.json [01:48:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P81012 and previous config saved to /var/cache/conftool/dbconfig/20250812-014833-ladsgroup.json [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0200) [02:03:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81013 and previous config saved to /var/cache/conftool/dbconfig/20250812-020341-ladsgroup.json [02:03:45] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:03:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1192.eqiad.wmnet with reason: Maintenance [02:04:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81014 and previous config saved to /var/cache/conftool/dbconfig/20250812-020403-ladsgroup.json [02:05:27] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [02:06:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81015 and previous config saved to /var/cache/conftool/dbconfig/20250812-020653-ladsgroup.json [02:09:12] (03PS7) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [02:12:33] (03PS1) 10BCornwall: ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 [02:12:58] (03CR) 10CI reject: [V:04-1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall) [02:13:46] (03PS2) 10BCornwall: ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 [02:15:19] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor) [02:22:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P81016 and previous config saved to /var/cache/conftool/dbconfig/20250812-022201-ladsgroup.json [02:27:05] (03CR) 10Pppery: [C:03+1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall) [02:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [02:37:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P81017 and previous config saved to /var/cache/conftool/dbconfig/20250812-023709-ladsgroup.json [02:52:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81018 and previous config saved to /var/cache/conftool/dbconfig/20250812-025216-ladsgroup.json [02:52:21] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:52:32] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1203.eqiad.wmnet with reason: Maintenance [02:52:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81019 and previous config saved to /var/cache/conftool/dbconfig/20250812-025239-ladsgroup.json [02:54:32] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11076470 (10Midleading) Please be more clear about the UA policy enforced here. I am always setting the `Api-User-Agent` header in my code with my Wikipedia username inside, but... [02:55:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81020 and previous config saved to /var/cache/conftool/dbconfig/20250812-025522-ladsgroup.json [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0300) [03:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:10:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P81021 and previous config saved to /var/cache/conftool/dbconfig/20250812-031029-ladsgroup.json [03:17:26] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11076476 (10RLazarus) Thanks @ecarg! I should be able to help with this. A couple of questions, each of them hopeful... [03:25:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P81022 and previous config saved to /var/cache/conftool/dbconfig/20250812-032537-ladsgroup.json [03:40:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81023 and previous config saved to /var/cache/conftool/dbconfig/20250812-034045-ladsgroup.json [03:40:49] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:41:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1209.eqiad.wmnet with reason: Maintenance [03:41:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81024 and previous config saved to /var/cache/conftool/dbconfig/20250812-034107-ladsgroup.json [03:43:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81025 and previous config saved to /var/cache/conftool/dbconfig/20250812-034353-ladsgroup.json [03:59:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P81026 and previous config saved to /var/cache/conftool/dbconfig/20250812-035900-ladsgroup.json [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0400) [04:04:24] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.11 (duration: 04m 19s) [04:09:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401651 (10phaultfinder) 03NEW [04:14:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P81027 and previous config saved to /var/cache/conftool/dbconfig/20250812-041408-ladsgroup.json [04:29:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81028 and previous config saved to /var/cache/conftool/dbconfig/20250812-042915-ladsgroup.json [04:29:20] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:29:31] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1214.eqiad.wmnet with reason: Maintenance [04:29:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81029 and previous config saved to /var/cache/conftool/dbconfig/20250812-042937-ladsgroup.json [04:32:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81030 and previous config saved to /var/cache/conftool/dbconfig/20250812-043212-ladsgroup.json [04:47:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P81031 and previous config saved to /var/cache/conftool/dbconfig/20250812-044719-ladsgroup.json [05:02:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P81032 and previous config saved to /var/cache/conftool/dbconfig/20250812-050227-ladsgroup.json [05:08:10] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81033 and previous config saved to /var/cache/conftool/dbconfig/20250812-051735-ladsgroup.json [05:17:39] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [05:17:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1226.eqiad.wmnet with reason: Maintenance [05:17:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81034 and previous config saved to /var/cache/conftool/dbconfig/20250812-051757-ladsgroup.json [05:20:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81035 and previous config saved to /var/cache/conftool/dbconfig/20250812-052037-ladsgroup.json [05:35:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P81036 and previous config saved to /var/cache/conftool/dbconfig/20250812-053544-ladsgroup.json [05:41:58] PROBLEM - Disk space on an-worker1145 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 155528 MB (4% inode=99%): /var/lib/hadoop/data/f 155831 MB (4% inode=99%): /var/lib/hadoop/data/m 153241 MB (4% inode=99%): /var/lib/hadoop/data/e 159793 MB (4% inode=99%): /var/lib/hadoop/data/c 151986 MB (4% inode=99%): /var/lib/hadoop/data/b 159424 MB (4% inode=99%): /var/lib/hadoop/data/l 154051 MB (4% inode=99%): /var/lib/hadoop/data [05:41:58] 9 MB (4% inode=99%): /var/lib/hadoop/data/g 150136 MB (3% inode=99%): /var/lib/hadoop/data/j 160361 MB (4% inode=99%): /var/lib/hadoop/data/d 159898 MB (4% inode=99%): /var/lib/hadoop/data/h 157962 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1145&var-datasource=eqiad+prometheus/ops [05:50:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P81037 and previous config saved to /var/cache/conftool/dbconfig/20250812-055052-ladsgroup.json [05:56:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399249)', diff saved to https://phabricator.wikimedia.org/P81038 and previous config saved to /var/cache/conftool/dbconfig/20250812-055623-fceratto.json [05:56:28] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0600) [06:06:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81039 and previous config saved to /var/cache/conftool/dbconfig/20250812-060559-ladsgroup.json [06:06:04] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [06:06:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [06:06:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2152.codfw.wmnet with reason: Maintenance [06:07:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81040 and previous config saved to /var/cache/conftool/dbconfig/20250812-060705-ladsgroup.json [06:07:17] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm [06:07:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm [06:08:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81041 and previous config saved to /var/cache/conftool/dbconfig/20250812-060857-ladsgroup.json [06:11:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P81042 and previous config saved to /var/cache/conftool/dbconfig/20250812-061130-fceratto.json [06:13:10] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:18:32] (03CR) 10Tchanders: [C:03+1] Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [06:22:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [06:24:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P81043 and previous config saved to /var/cache/conftool/dbconfig/20250812-062405-ladsgroup.json [06:24:56] (03CR) 10Muehlenhoff: [C:03+2] Remove ldap-admins from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1169633 (owner: 10Muehlenhoff) [06:25:33] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177424 (owner: 10Muehlenhoff) [06:26:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P81044 and previous config saved to /var/cache/conftool/dbconfig/20250812-062638-fceratto.json [06:32:19] (03CR) 10Muehlenhoff: [C:03+2] No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) (owner: 10Muehlenhoff) [06:32:42] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [06:32:49] (03PS3) 10Muehlenhoff: Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) [06:33:45] (03PS14) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [06:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [06:37:40] RECOVERY - Disk space on an-worker1139 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops [06:38:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [06:38:29] (03PS15) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [06:39:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P81045 and previous config saved to /var/cache/conftool/dbconfig/20250812-063913-ladsgroup.json [06:39:44] vriley@cumin1002 provision (PID 1498264) is awaiting input [06:41:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399249)', diff saved to https://phabricator.wikimedia.org/P81046 and previous config saved to /var/cache/conftool/dbconfig/20250812-064146-fceratto.json [06:41:51] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:42:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [06:42:07] (03CR) 10Slyngshede: [C:03+1] Record LDAP access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1177414 (https://phabricator.wikimedia.org/T400374) (owner: 10Muehlenhoff) [06:42:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81047 and previous config saved to /var/cache/conftool/dbconfig/20250812-064209-fceratto.json [06:42:39] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1177414 (https://phabricator.wikimedia.org/T400374) (owner: 10Muehlenhoff) [06:44:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [06:46:45] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11076744 (10QChris) Seeing this task getting moved on some boards ... I've signed the NDA some days ago. @KFrancis Is there anything missing from me? [06:46:56] (03PS1) 10Slyngshede: data.yaml: Add tracking entry for caro [puppet] - 10https://gerrit.wikimedia.org/r/1177757 [06:51:43] vriley@cumin1002 provision (PID 1498264) is awaiting input [06:54:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81048 and previous config saved to /var/cache/conftool/dbconfig/20250812-065420-ladsgroup.json [06:54:25] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [06:54:36] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2154.codfw.wmnet with reason: Maintenance [06:54:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81049 and previous config saved to /var/cache/conftool/dbconfig/20250812-065443-ladsgroup.json [06:56:52] (03PS16) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) [06:57:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81050 and previous config saved to /var/cache/conftool/dbconfig/20250812-065726-ladsgroup.json [06:57:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1177757 (owner: 10Slyngshede) [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:03:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [07:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:06:16] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add tracking entry for caro [puppet] - 10https://gerrit.wikimedia.org/r/1177757 (owner: 10Slyngshede) [07:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:12:34] (03PS3) 10Slyngshede: data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374) [07:12:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P81051 and previous config saved to /var/cache/conftool/dbconfig/20250812-071234-ladsgroup.json [07:13:06] (03CR) 10Majavah: [C:03+1] "are you sure this is trixie-specific and not a general fix for T394304? either way, +1 for having this at least on trixie" [puppet] - 10https://gerrit.wikimedia.org/r/1177450 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott) [07:14:47] (03CR) 10Majavah: Add Trixie images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [07:16:07] (03CR) 10Muehlenhoff: "ready for review, the only difference in PCC compared to the ERB is that the SPDX header no longer appears in the user-visible sshd config" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [07:17:08] !log hashar@deploy1003 Started deploy [integration/docroot@77c4765]: build: Updating mediawiki/mediawiki-phan-config to 0.17.0 [07:17:21] !log hashar@deploy1003 Finished deploy [integration/docroot@77c4765]: build: Updating mediawiki/mediawiki-phan-config to 0.17.0 (duration: 00m 13s) [07:27:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P81052 and previous config saved to /var/cache/conftool/dbconfig/20250812-072742-ladsgroup.json [07:29:20] 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11076853 (10TheDJ) @Midleading you are always supposed to have a user-agent. Api-user-agent is just for situations where you are unable to MODIFY that agent to provide additiona... [07:30:32] RECOVERY - Disk space on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [07:42:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81053 and previous config saved to /var/cache/conftool/dbconfig/20250812-074249-ladsgroup.json [07:42:54] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [07:43:05] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance [07:43:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81054 and previous config saved to /var/cache/conftool/dbconfig/20250812-074312-ladsgroup.json [07:45:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81055 and previous config saved to /var/cache/conftool/dbconfig/20250812-074556-ladsgroup.json [07:49:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: Maintenance [07:49:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: Maintenance [07:49:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81056 and previous config saved to /var/cache/conftool/dbconfig/20250812-074932-fceratto.json [07:49:36] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [07:50:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81057 and previous config saved to /var/cache/conftool/dbconfig/20250812-075041-fceratto.json [07:58:20] RECOVERY - Host ms-fe2017 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [08:01:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P81058 and previous config saved to /var/cache/conftool/dbconfig/20250812-080104-ladsgroup.json [08:05:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81059 and previous config saved to /var/cache/conftool/dbconfig/20250812-080549-fceratto.json [08:07:25] (03CR) 10Btullis: [C:03+2] Remove the 52 decommissioning hosts from the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177353 (https://phabricator.wikimedia.org/T397172) (owner: 10Btullis) [08:16:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P81060 and previous config saved to /var/cache/conftool/dbconfig/20250812-081611-ladsgroup.json [08:17:40] (03PS2) 10Ladsgroup: mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) [08:17:46] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [08:20:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81061 and previous config saved to /var/cache/conftool/dbconfig/20250812-082056-fceratto.json [08:27:00] Hi! We want to run a maintenance script to add wikidata support for a new language wikipedia. Let us know if this is a bad time, otherwise we will proceed (#wikidata-for-wikimedia-projects at WMDE) https://phabricator.wikimedia.org/T399789 [08:29:11] (03PS1) 10MVernon: Prepare ms-fe20[17-20] for production use [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225) [08:31:15] (03PS1) 10Vgutierrez: haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) [08:31:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81062 and previous config saved to /var/cache/conftool/dbconfig/20250812-083119-ladsgroup.json [08:31:23] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [08:31:34] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2163.codfw.wmnet with reason: Maintenance [08:31:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81063 and previous config saved to /var/cache/conftool/dbconfig/20250812-083141-ladsgroup.json [08:31:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11077061 (10MatthewVernon) Thanks! [08:34:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81064 and previous config saved to /var/cache/conftool/dbconfig/20250812-083426-ladsgroup.json [08:35:04] !log ladsgroup@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [08:35:26] !log ladsgroup@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [08:36:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81065 and previous config saved to /var/cache/conftool/dbconfig/20250812-083603-fceratto.json [08:36:08] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [08:36:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2146.codfw.wmnet with reason: Maintenance [08:36:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81066 and previous config saved to /var/cache/conftool/dbconfig/20250812-083637-fceratto.json [08:37:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81067 and previous config saved to /var/cache/conftool/dbconfig/20250812-083746-fceratto.json [08:38:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:41:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [08:41:54] 06SRE, 06serviceops-radar: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077095 (10Zabe) p:05Unbreak!→03High Ok it went down a bit, but in [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy1003&var-datasource=000000026&var-cl... [08:44:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [08:46:13] (03CR) 10Federico Ceratto: [C:03+1] "I checked that the hostnames match the description." [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225) (owner: 10MVernon) [08:46:26] !log suzannewood@deploy1003:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [08:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:34] (03Abandoned) 10Federico Ceratto: zarcillo: allow egress to gerrit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170574 (https://phabricator.wikimedia.org/T389663) (owner: 10Federico Ceratto) [08:48:46] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5 [08:49:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P81068 and previous config saved to /var/cache/conftool/dbconfig/20250812-084933-ladsgroup.json [08:52:11] (03PS2) 10Vgutierrez: haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) [08:52:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5 [08:52:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81069 and previous config saved to /var/cache/conftool/dbconfig/20250812-085254-fceratto.json [08:53:57] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [08:54:32] (03CR) 10Urbanecm: [C:03+1] "there is no objection for sufficient time, let's do this. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe) [08:54:43] (03CR) 10Urbanecm: [C:03+1] "Agreed, added my LGTM. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe) [08:54:53] jouncebot: nowandnext [08:54:53] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [08:54:53] In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1000) [08:55:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe) [08:55:23] (03CR) 10Vgutierrez: [C:03+2] haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez) [08:55:31] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11077149 (10JAllemandou) [08:56:11] (03Merged) 10jenkins-bot: Remove centralauth-unmerge from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe) [08:56:41] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]] [08:56:45] T400755: Remove the stewards ability to delete/unmerge global accounts - https://phabricator.wikimedia.org/T400755 [08:58:43] suzannewoodWMDE2: Hi, next time, could you run it using mwscript-k8s? The tool supports running on multiple wikis and runs on kubernetes. https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Running_on_multiple_wikis_(the_safe_way) The "old" foreachwiki/mwscript wrappers WILL be deprecated completely soon. [08:58:53] !log urbanecm@deploy1003 zabe, urbanecm: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:59:26] Please spread the word and/or update docs as well, as I've seen one of your colleagues from WMDE run a script with foreachwiki yesterday as well, but wasn't fast enough to tell them before they left the channel [09:00:09] !log urbanecm@deploy1003 zabe, urbanecm: Continuing with sync [09:01:27] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11077173 (10JAllemandou) [09:02:32] (03CR) 10Urbanecm: [C:03+2] Add CommunityConfigurationExample to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [09:03:29] (03Merged) 10jenkins-bot: Add CommunityConfigurationExample to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [09:04:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P81070 and previous config saved to /var/cache/conftool/dbconfig/20250812-090441-ladsgroup.json [09:05:47] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]] (duration: 09m 05s) [09:05:50] T400755: Remove the stewards ability to delete/unmerge global accounts - https://phabricator.wikimedia.org/T400755 [09:06:13] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1173924|Add CommunityConfigurationExample to extension-list (T372049)]] [09:06:17] T372049: Enable CommunityConfiguration Example in one beta wiki - https://phabricator.wikimedia.org/T372049 [09:06:40] claime thanks for flagging I'll check out the docs and let my teammates know [09:07:09] joelyrookewmde: thanks a bunch! [09:08:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81071 and previous config saved to /var/cache/conftool/dbconfig/20250812-090802-fceratto.json [09:11:21] (03PS1) 10Vgutierrez: Match all headers in HAProxy using a variable [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177945 [09:16:43] (03CR) 10Vgutierrez: [V:03+2 C:03+2] Match all headers in HAProxy using a variable [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177945 (owner: 10Vgutierrez) [09:17:21] @claime yes, thanks, we will update our documentation [09:17:36] !log manually insert 'SecurePoll' into zhwiki.content_models # T401641 [09:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:40] T401641: MediaWiki\Storage\NameTableAccessException: No insert possible but primary DB didn't give us a record for 'SecurePoll' in 'content_models' - https://phabricator.wikimedia.org/T401641 [09:17:40] tyvm! [09:17:47] !log vgutierrez@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "use a var to match all headers on haproxy - vgutierrez@cumin1002" [09:17:48] !log vgutierrez@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: use a var to match all headers on haproxy - vgutierrez@cumin1002 [09:18:21] !log Finished populateSitesTable for 'zghwiktionary'  https://phabricator.wikimedia.org/T399789 [09:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: use a var to match all headers on haproxy - vgutierrez@cumin1002 [09:18:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "use a var to match all headers on haproxy - vgutierrez@cumin1002" [09:19:34] (03CR) 10Dreamy Jazz: [C:04-1] "Per internal channels, we probably need to hide the onboarding dialog on wikis where temporary accounts would never be present." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [09:19:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81072 and previous config saved to /var/cache/conftool/dbconfig/20250812-091948-ladsgroup.json [09:19:53] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [09:20:04] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2164.codfw.wmnet with reason: Maintenance [09:20:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81073 and previous config saved to /var/cache/conftool/dbconfig/20250812-092011-ladsgroup.json [09:22:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81074 and previous config saved to /var/cache/conftool/dbconfig/20250812-092255-ladsgroup.json [09:23:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81075 and previous config saved to /var/cache/conftool/dbconfig/20250812-092310-fceratto.json [09:23:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [09:23:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: Maintenance [09:23:30] We have now finished running scripts (#wikidata-for-wikimedia-projects at WMDE) [09:23:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81076 and previous config saved to /var/cache/conftool/dbconfig/20250812-092334-fceratto.json [09:24:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81077 and previous config saved to /var/cache/conftool/dbconfig/20250812-092443-fceratto.json [09:25:02] (03CR) 10MVernon: [C:03+2] Prepare ms-fe20[17-20] for production use [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225) (owner: 10MVernon) [09:25:48] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077261 (10Clement_Goubert) a:03Clement_Goubert [09:28:21] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-fe[2017-2020].codfw.wmnet with reason: reboot [09:29:43] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077281 (10Clement_Goubert) The `/srv/docker/overlay2` is taking up 177GB because we keep 7 days of images, which is probably way overkill. I'll run a prune keeping the last 3 days and will update the relevan... [09:31:43] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077286 (10Clement_Goubert) ` # /usr/bin/docker image prune --all --force --filter until=72h [...] Total reclaimed space: 102.2GB ` [09:33:41] (03PS1) 10Brouberol: Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) [09:33:43] (03PS1) 10Brouberol: modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) [09:33:44] (03PS1) 10Brouberol: datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) [09:33:47] (03PS1) 10Brouberol: datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) [09:34:08] (03PS1) 10Clément Goubert: deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) [09:34:28] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.661 second response time https://wikitech.wikimedia.org/wiki/Swift [09:34:28] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.167 second response time https://wikitech.wikimedia.org/wiki/Swift [09:34:28] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.659 second response time https://wikitech.wikimedia.org/wiki/Swift [09:35:04] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.302 second response time https://wikitech.wikimedia.org/wiki/Swift [09:36:10] (03CR) 10Dreamy Jazz: [C:04-1] "I think we can solve this by setting `$wgDefaultUserOptions['checkuser-temporary-accounts-onboarding-dialog-seen'] = true;` for the wikis " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [09:38:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P81078 and previous config saved to /var/cache/conftool/dbconfig/20250812-093803-ladsgroup.json [09:38:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:39:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81079 and previous config saved to /var/cache/conftool/dbconfig/20250812-093951-fceratto.json [09:42:08] (03PS1) 10Btullis: Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160) [09:43:27] (03CR) 10Brouberol: [C:03+1] Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [09:43:36] (03CR) 10Btullis: [C:03+2] Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis) [09:44:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [09:47:59] uhh... `The connection to the server kubemaster.svc.codfw.wmnet:6443 was refused - did you specify the right host or port`, that doesn't seem right [09:49:03] Huh [09:49:16] urbanecm: scap saying that? [09:49:23] correct [09:49:32] weirdly enough, it did not abort the deployment [09:49:35] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173924|Add CommunityConfigurationExample to extension-list (T372049)]] (duration: 43m 21s) [09:49:39] T372049: Enable CommunityConfiguration Example in one beta wiki - https://phabricator.wikimedia.org/T372049 [09:49:53] (03PS1) 10Hashar: admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958 [09:50:39] claime: https://phabricator.wikimedia.org/P81080 are the logs [09:51:30] let me know if i should re-deploy my patch or do something else [09:51:40] (03PS6) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P81081 and previous config saved to /var/cache/conftool/dbconfig/20250812-095310-ladsgroup.json [09:54:15] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11077334 (10MoritzMuehlenhoff) [09:54:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81082 and previous config saved to /var/cache/conftool/dbconfig/20250812-095458-fceratto.json [09:55:30] !log systemctl start pretrain # T396375 [09:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:34] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [09:55:40] Aborting: git is not clean: /srv/patches [09:55:41] of course [09:55:52] (03CR) 10STran: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [09:56:04] urbanecm: looks like it may just be the progress lookup failing [09:56:05] hashar: do note my scap got very confused midway [09:56:10] so i have no idea if my patch is cleanly applied [09:56:22] yeah I am not entirely sure what is going on [09:56:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-fe[2017-2020].codfw.wmnet [09:56:23] It should be, checking deployments [09:56:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe[2017-2020].codfw.wmnet [09:56:34] cause `/srv/patches` is clean as far as I can tell [09:57:09] I am missing the time when we did a scp to a NFS share and had instant deployment/outage :b [09:57:11] urbanecm: no diff in codfw deployments [09:57:16] So I think it's good [09:57:21] (03PS1) 10Muehlenhoff: Add maps201[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177959 (https://phabricator.wikimedia.org/T400637) [09:57:23] okay, sounds good then. ty for checking! [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:01] urbanecm: I think what may have happened is the codfw kube apiserver restarted during scap [09:58:14] I'll check the wikikube-ctrl nodes [09:59:02] yeah that's exactly that [09:59:03] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1000) [10:01:20] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [10:01:33] !log systemctl start pretrain # T396375 [10:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:38] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [10:01:46] that one happens over night usually [10:02:11] that is to rebuild all images from scratch which takes a while [10:02:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:04:43] urbanecm: certificates for the apiservers got updated, so they restarted [10:05:32] I think all the calls that failed failed at the same time because they were hitting the same apiserver, then it actually hit one that was ok [10:06:29] (03CR) 10Muehlenhoff: [C:03+2] Add maps201[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177959 (https://phabricator.wikimedia.org/T400637) (owner: 10Muehlenhoff) [10:07:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [10:08:03] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet [10:08:04] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet [10:08:05] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1017.eqiad.wmnet [10:08:05] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1017.eqiad.wmnet [10:08:06] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet [10:08:07] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet [10:08:07] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1018.eqiad.wmnet [10:08:08] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1018.eqiad.wmnet [10:08:09] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet [10:08:09] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet [10:08:10] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1019.eqiad.wmnet [10:08:11] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1019.eqiad.wmnet [10:08:12] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet [10:08:12] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet [10:08:13] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1020.eqiad.wmnet [10:08:14] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1020.eqiad.wmnet [10:08:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81083 and previous config saved to /var/cache/conftool/dbconfig/20250812-100817-ladsgroup.json [10:08:31] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:08:33] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2166.codfw.wmnet with reason: Maintenance [10:08:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81084 and previous config saved to /var/cache/conftool/dbconfig/20250812-100840-ladsgroup.json [10:10:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81085 and previous config saved to /var/cache/conftool/dbconfig/20250812-101006-fceratto.json [10:10:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:10:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:10:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81086 and previous config saved to /var/cache/conftool/dbconfig/20250812-101029-fceratto.json [10:11:15] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 12s) [10:11:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81087 and previous config saved to /var/cache/conftool/dbconfig/20250812-101123-ladsgroup.json [10:12:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81088 and previous config saved to /var/cache/conftool/dbconfig/20250812-101238-fceratto.json [10:15:31] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177365 (owner: 10L10n-bot) [10:18:54] !log systemctl start train-presync # T396375 [10:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [10:19:40] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375) [10:19:42] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [10:20:39] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [10:21:15] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.14 refs T396375 [10:26:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P81089 and previous config saved to /var/cache/conftool/dbconfig/20250812-102631-ladsgroup.json [10:27:42] that is train-presync that rebuilds all images from scratches :b [10:27:46] I got confused earlier [10:27:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81090 and previous config saved to /var/cache/conftool/dbconfig/20250812-102746-fceratto.json [10:27:54] I am going to have lunch while it is going on [10:28:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11077422 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T400637#11075140, @wiki_willy wrote: > Hi @MoritzMuehlenhoff - are you able to help confirm... [10:29:10] (03PS1) 10Hnowlan: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) [10:36:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11077435 (10MoritzMuehlenhoff) [10:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [10:36:39] (03CR) 10Hnowlan: [C:03+1] "lgtm, one nit/hmm" [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert) [10:37:33] (03PS2) 10Clément Goubert: deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) [10:37:35] (03CR) 10Clément Goubert: deployment_server: Prune old images every 3 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert) [10:40:32] (03CR) 10Clément Goubert: [C:03+2] deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert) [10:41:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P81091 and previous config saved to /var/cache/conftool/dbconfig/20250812-104138-ladsgroup.json [10:42:17] 06SRE, 06serviceops, 13Patch-For-Review: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077453 (10Clement_Goubert) 05Open→03Resolved Old images will now be pruned every 3 days, and disk space is at manageable levels [10:42:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81092 and previous config saved to /var/cache/conftool/dbconfig/20250812-104254-fceratto.json [10:46:43] (03PS1) 10Muehlenhoff: Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) [10:46:59] (03CR) 10CI reject: [V:04-1] Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) (owner: 10Muehlenhoff) [10:49:17] (03PS2) 10Muehlenhoff: Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) [10:52:23] (03CR) 10Muehlenhoff: [C:03+2] Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) (owner: 10Muehlenhoff) [10:55:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11077507 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T400638#11075133, @wiki_willy wrote: > Hi @MoritzMuehlenhoff - are you able to confirm the... [10:56:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81093 and previous config saved to /var/cache/conftool/dbconfig/20250812-105646-ladsgroup.json [10:56:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [10:56:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [10:56:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81094 and previous config saved to /var/cache/conftool/dbconfig/20250812-105657-ladsgroup.json [10:58:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81095 and previous config saved to /var/cache/conftool/dbconfig/20250812-105801-fceratto.json [10:58:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:58:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: Maintenance [10:58:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81096 and previous config saved to /var/cache/conftool/dbconfig/20250812-105824-fceratto.json [10:59:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81097 and previous config saved to /var/cache/conftool/dbconfig/20250812-105933-fceratto.json [10:59:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81098 and previous config saved to /var/cache/conftool/dbconfig/20250812-105941-ladsgroup.json [11:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:20] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.14 refs T396375 (duration: 43m 06s) [11:04:25] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [11:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:05:52] (03PS3) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) [11:06:02] (03PS2) 10Clément Goubert: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [11:06:47] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6557/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [11:08:03] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6558/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [11:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:58] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [11:14:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81099 and previous config saved to /var/cache/conftool/dbconfig/20250812-111440-fceratto.json [11:14:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P81100 and previous config saved to /var/cache/conftool/dbconfig/20250812-111449-ladsgroup.json [11:15:21] 06SRE, 06SRE Observability: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T401671 (10MoritzMuehlenhoff) 03NEW [11:15:24] (03PS2) 10Majavah: P:wmcs::instance: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177404 (https://phabricator.wikimedia.org/T401586) [11:18:56] (03CR) 10Majavah: [C:03+2] P:wmcs::instance: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177404 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [11:26:30] (03PS2) 10Urbanecm: [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) [11:26:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:27:40] (03Merged) 10jenkins-bot: [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:28:31] (03CR) 10Urbanecm: [C:03+2] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:28:36] (03PS2) 10Urbanecm: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) [11:28:39] (03CR) 10Urbanecm: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:28:41] (03CR) 10Urbanecm: [C:03+2] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:29:23] !log installing gnutls security updates on Bookworm [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:40] (03Merged) 10jenkins-bot: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm) [11:29:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81101 and previous config saved to /var/cache/conftool/dbconfig/20250812-112948-fceratto.json [11:29:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P81102 and previous config saved to /var/cache/conftool/dbconfig/20250812-112956-ladsgroup.json [11:44:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81103 and previous config saved to /var/cache/conftool/dbconfig/20250812-114455-fceratto.json [11:45:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: Maintenance [11:45:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:45:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81104 and previous config saved to /var/cache/conftool/dbconfig/20250812-114504-ladsgroup.json [11:45:09] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [11:45:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81105 and previous config saved to /var/cache/conftool/dbconfig/20250812-114514-fceratto.json [11:45:15] (03PS3) 10Amire80: Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 [11:45:20] (03CR) 10Ladsgroup: [C:03+2] Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [11:45:20] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2181.codfw.wmnet with reason: Maintenance [11:45:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81106 and previous config saved to /var/cache/conftool/dbconfig/20250812-114527-ladsgroup.json [11:45:48] (03Merged) 10jenkins-bot: Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [11:46:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81107 and previous config saved to /var/cache/conftool/dbconfig/20250812-114623-fceratto.json [11:46:43] (03CR) 10Dreamy Jazz: [C:03+1] Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [11:48:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81108 and previous config saved to /var/cache/conftool/dbconfig/20250812-114812-ladsgroup.json [11:50:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [11:55:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:59:51] (03PS1) 10Aklapper: Further reduce AVA debug output [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177976 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1200) [12:00:10] (03CR) 10Aklapper: [V:03+2 C:03+2] Further reduce AVA debug output [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177976 (owner: 10Aklapper) [12:00:40] (03PS1) 10Muehlenhoff: Record LDAP access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1177977 (https://phabricator.wikimedia.org/T400176) [12:01:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81109 and previous config saved to /var/cache/conftool/dbconfig/20250812-120131-fceratto.json [12:03:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P81110 and previous config saved to /var/cache/conftool/dbconfig/20250812-120319-ladsgroup.json [12:03:28] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11077664 (10MoritzMuehlenhoff) >>! In T400176#11075795, @Novem_Linguae wrote: > Thanks! I just tried to log into a couple of NDA tools such as Superset and Icinga an... [12:04:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:05:10] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm [12:05:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11077671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm [12:07:38] (03CR) 10Ladsgroup: [C:03+2] "I wanted to deploy the legacy system but there were many unrelated changes so I avoided it" [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [12:07:40] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 140872 MB (3% inode=99%): /var/lib/hadoop/data/m 167264 MB (4% inode=99%): /var/lib/hadoop/data/d 154477 MB (4% inode=99%): /var/lib/hadoop/data/b 145991 MB (3% inode=99%): /var/lib/hadoop/data/e 155497 MB (4% inode=99%): /var/lib/hadoop/data/g 148175 MB (3% inode=99%): /var/lib/hadoop/data/f 145370 MB (3% inode=99%): /var/lib/hadoop/data [12:07:40] 0 MB (4% inode=99%): /var/lib/hadoop/data/i 160191 MB (4% inode=99%): /var/lib/hadoop/data/j 157319 MB (4% inode=99%): /var/lib/hadoop/data/l 162026 MB (4% inode=99%): /var/lib/hadoop/data/c 169422 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [12:16:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81111 and previous config saved to /var/cache/conftool/dbconfig/20250812-121638-fceratto.json [12:18:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P81112 and previous config saved to /var/cache/conftool/dbconfig/20250812-121827-ladsgroup.json [12:19:56] zaway [12:25:43] 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11077788 (10Peachey88) [12:27:07] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152 (owner: 10PipelineBot) [12:29:23] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152 (owner: 10PipelineBot) [12:30:37] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1096-1099].eqiad.wmnet [12:30:45] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:31:08] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:31:23] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:31:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81113 and previous config saved to /var/cache/conftool/dbconfig/20250812-123145-fceratto.json [12:31:50] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:31:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:31:52] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:31:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81114 and previous config saved to /var/cache/conftool/dbconfig/20250812-123157-fceratto.json [12:33:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81115 and previous config saved to /var/cache/conftool/dbconfig/20250812-123306-fceratto.json [12:33:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81116 and previous config saved to /var/cache/conftool/dbconfig/20250812-123334-ladsgroup.json [12:33:39] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [12:33:50] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2195.codfw.wmnet with reason: Maintenance [12:33:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81117 and previous config saved to /var/cache/conftool/dbconfig/20250812-123357-ladsgroup.json [12:34:03] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:34:30] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:34:41] btullis@cumin1003 decommission (PID 1700582) is awaiting input [12:35:15] (03CR) 10Btullis: [C:03+1] Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:35:20] (03CR) 10Btullis: [C:03+1] modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:35:25] (03CR) 10Btullis: [C:03+1] datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:35:31] (03CR) 10Btullis: [C:03+1] datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:36:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81118 and previous config saved to /var/cache/conftool/dbconfig/20250812-123633-ladsgroup.json [12:41:46] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987 [12:42:24] (03PS3) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) [12:42:28] (03PS2) 10Anzx: zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) [12:42:30] (03PS2) 10Anzx: tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) [12:42:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [12:42:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [12:43:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [12:43:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [12:44:50] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177370 (owner: 10PipelineBot) [12:45:26] btullis@cumin1003 decommission (PID 1700582) is awaiting input [12:47:40] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 150679 MB (4% inode=99%): /var/lib/hadoop/data/m 172247 MB (4% inode=99%): /var/lib/hadoop/data/d 160926 MB (4% inode=99%): /var/lib/hadoop/data/b 156245 MB (4% inode=99%): /var/lib/hadoop/data/e 160367 MB (4% inode=99%): /var/lib/hadoop/data/g 151595 MB (4% inode=99%): /var/lib/hadoop/data/f 150022 MB (3% inode=99%): /var/lib/hadoop/data [12:47:40] 7 MB (4% inode=99%): /var/lib/hadoop/data/i 166116 MB (4% inode=99%): /var/lib/hadoop/data/j 163782 MB (4% inode=99%): /var/lib/hadoop/data/l 171464 MB (4% inode=99%): /var/lib/hadoop/data/c 173358 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [12:47:57] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987 (owner: 10PipelineBot) [12:48:06] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177477 (owner: 10PipelineBot) [12:48:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81119 and previous config saved to /var/cache/conftool/dbconfig/20250812-124814-fceratto.json [12:48:25] (03CR) 10Brouberol: [C:03+2] Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:48:28] (03CR) 10Brouberol: [C:03+2] modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:48:31] (03CR) 10Brouberol: [C:03+2] datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:48:33] (03CR) 10Brouberol: [C:03+2] datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:49:36] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987 (owner: 10PipelineBot) [12:50:23] (03Merged) 10jenkins-bot: Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:50:24] (03Merged) 10jenkins-bot: modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:50:49] (03Merged) 10jenkins-bot: datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:50:51] (03Merged) 10jenkins-bot: datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [12:50:53] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:50:55] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [12:51:22] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:51:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P81120 and previous config saved to /var/cache/conftool/dbconfig/20250812-125140-ladsgroup.json [12:52:33] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [12:52:40] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:53:30] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:53:32] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:53:35] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:54:00] (03PS2) 10Anzx: minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) [12:54:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [12:54:20] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:54:22] (03PS3) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [12:54:45] (03CR) 10Xcollazo: "Never mind, this patch is already merged, and I can browse to the ZIM files from the index page. Seems like this was a duplicate link." [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric) [12:57:10] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1096-1099].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:57:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1096-1099].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:57:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1096-1099].eqiad.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1300). [13:00:05] Tran and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] 👋 [13:00:15] o/ [13:01:23] I'll do Tran's [13:03:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81121 and previous config saved to /var/cache/conftool/dbconfig/20250812-130321-fceratto.json [13:03:38] I'll start now [13:03:48] \o [13:04:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:05:10] (03Merged) 10jenkins-bot: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [13:05:10] o/ [13:05:23] (03CR) 10Tiziano Fogli: [C:03+2] Record LDAP access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1177977 (https://phabricator.wikimedia.org/T400176) (owner: 10Muehlenhoff) [13:05:30] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [13:05:36] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]] [13:05:39] T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672 [13:06:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P81122 and previous config saved to /var/cache/conftool/dbconfig/20250812-130648-ladsgroup.json [13:07:41] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:09:55] (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:10:22] !log tchanders@deploy1003 stran, tchanders: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:08] (03CR) 10Anzx: zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:11:18] (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [13:13:00] (03CR) 10Tiziano Fogli: [C:03+1] monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [13:13:15] (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:13:38] (03CR) 10Majavah: [C:03+2] monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [13:14:08] (03PS1) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster This patch updates the current delegations based on data from the puppet repo hieradata/common/kubernetes.yaml file for the codfw cluster. [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037) [13:14:17] (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [13:14:52] (03CR) 10Vgutierrez: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [13:14:53] (03PS2) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster This patch updates the current delegations based on data from the puppet repo hieradata/common/kubernetes.yaml file for the codfw cluster. [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037) [13:15:25] (03PS3) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037) [13:15:44] ...testing... [13:16:32] looks good, continuing [13:16:36] !log tchanders@deploy1003 stran, tchanders: Continuing with sync [13:16:38] (03PS4) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037) [13:18:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81123 and previous config saved to /var/cache/conftool/dbconfig/20250812-131829-fceratto.json [13:18:34] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:18:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: Maintenance [13:18:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81124 and previous config saved to /var/cache/conftool/dbconfig/20250812-131851-fceratto.json [13:20:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81125 and previous config saved to /var/cache/conftool/dbconfig/20250812-132001-fceratto.json [13:20:22] (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:20:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:21:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11078018 (10VRiley-WMF) cloudcephosd1044 seems to time out during the install. Checked to make sure it was booting from the first disk. I will be looking into the connections very soon. [13:21:33] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:21:51] (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [13:21:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81126 and previous config saved to /var/cache/conftool/dbconfig/20250812-132155-ladsgroup.json [13:22:00] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [13:22:11] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance [13:23:23] (03PS1) 10Hashar: gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) [13:23:45] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]] (duration: 18m 09s) [13:23:49] T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672 [13:24:33] (03PS1) 10Btullis: Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175) [13:25:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.479s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:25:52] I'm finished - next deployer please feel free to go ahead [13:26:03] (03CR) 10Lucas Werkmeister (WMDE): tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:26:09] ok! [13:26:18] I can deploy some of anzx’ patches :) [13:27:02] (03CR) 10Anzx: tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:28:06] (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [13:30:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.984s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:30:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:31:17] ok, I think four changes are ready to deploy, one has an open question [13:31:27] and I think they can all go together, should be safe enough [13:32:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:32:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:32:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:32:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:34:10] (03CR) 10Anzx: madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [13:34:53] (03CR) 10SBassett: [C:03+1] prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [13:34:54] (03Merged) 10jenkins-bot: tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:35:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81127 and previous config saved to /var/cache/conftool/dbconfig/20250812-133508-fceratto.json [13:35:26] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1044.eqiad.wmnet with OS bookworm [13:35:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11078067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [13:35:43] (03Merged) 10jenkins-bot: zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [13:35:49] (03Merged) 10jenkins-bot: minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx) [13:36:05] (03Merged) 10jenkins-bot: tlwikisource: add author ( Manunulat ) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx) [13:36:28] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]] [13:36:36] T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654 [13:36:36] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [13:36:37] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [13:37:05] (03PS1) 10Majavah: P:docker: Add trixie as a known base image [puppet] - 10https://gerrit.wikimedia.org/r/1177995 [13:38:32] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]] synced to the testservers (see https://wi [13:38:32] kitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:38:41] anzx: please test :) [13:39:21] checking [13:39:37] hm, I think I see a problem on zgh.wiktionary.org [13:39:55] previously: ⴰⵎⵙⴳⴷⴰⵍ ⵏ Wiktionary, alias Wiktionary talk [13:40:03] now: ⴰⵎⵙⴳⴷⴰⵍ ⵏ ⵡⵉⴽⵉⵎⴰⵡⴰⵍ, alias ⵡⵉⴽⵉⵎⴰⵡⴰⵍ talk [13:40:27] (actually, scratch the alias part, there’s still a “Wiktionary talk” alias in place) [13:40:39] but we might want a new alias for ⴰⵎⵙⴳⴷⴰⵍ ⵏ Wiktionary? [13:40:44] (could be done in a follow-up change) [13:40:55] (03CR) 10Hashar: "> Do we also want to rename that user in the database later?" [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [13:41:36] (03PS1) 10Majavah: Add python-trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 [13:42:31] (03CR) 10Hashar: "Google search console reports them with:" [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [13:42:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:43:25] (03CR) 10Majavah: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah) [13:44:06] and something similar on min.wikibooks.org as well, if I’m not mistaken [13:44:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet [13:45:24] with the change in effect, “Rundiang Wikibooks” and “Pembicaraan Wikibooks” no longer exist as namespaces or aliases, so new aliases are probably a good idea? [13:47:45] Lucas_WMDE: i will create follow up for adding new aliases, I think this happens for talkpage namespace where project names comes in end [13:48:23] sounds good [13:48:35] do you want to test anything else or should we go ahead with these changes for now? [13:49:10] Lucas_WMDE: checked others, good to sync [13:49:28] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Continuing with sync [13:49:30] alright, thanks! [13:49:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet [13:50:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81128 and previous config saved to /var/cache/conftool/dbconfig/20250812-135016-fceratto.json [13:50:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [13:51:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet [13:51:57] !log sukhe@dns1004 START - running authdns-update [13:52:36] (03PS2) 10Btullis: Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175) [13:52:51] !log sukhe@dns1004 END - running authdns-update [13:53:36] (03PS1) 10Stevemunene: zookeeper: Remove an-druid100[1-2] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) [13:53:51] anzx: also, I’m not seeing a new patch set in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1177936 yet, did you mean to upload one? [13:53:57] (or maybe I misunderstood your comment) [13:54:49] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]] (duration: 18m 21s) [13:54:51] (03PS4) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) [13:54:56] T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654 [13:54:56] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [13:54:57] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [13:54:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401651#11078134 (10phaultfinder) [13:55:19] Lucas_WMDE: published edit [13:55:41] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [13:55:43] ah, now it’s there :) [13:55:59] (03CR) 10Btullis: [C:03+2] Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175) (owner: 10Btullis) [13:56:13] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance [13:56:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet [13:56:36] jouncebot: nowandnext [13:56:36] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1300) [13:56:36] In 0 hour(s) and 33 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1430) [13:56:44] (03CR) 10Jelto: [C:03+1] gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [13:56:52] (03CR) 10Andrew Bogott: [C:03+2] "I'm not sure, although certainly we don't get the same complaint about restarting on Bookworm." [puppet] - 10https://gerrit.wikimedia.org/r/1177450 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott) [13:57:12] (03CR) 10Jelto: [C:03+2] gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar) [13:57:56] I don’t think we have enough time for the madwikisource change tbh [13:58:07] but let me see which maintenance scripts should be run on the wikis that got deployments [13:58:23] Lucas_WMDE: i can schedule all others to next window [13:58:25] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:58:28] sounds good, thanks! [13:58:31] namespaceDupes I think [13:59:25] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [13:59:44] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes zghwiktionary --fix # T399785 [14:00:09] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:00:13] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:01:11] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes tlwikisource --fix # T388654 [14:01:15] T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654 [14:01:52] !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes minwikibooks --fix # T395499 [14:01:56] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [14:03:11] (03PS2) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) [14:03:19] (03PS1) 10Fabfur: profile,prometheus,haproxykafka: support for rdkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178001 (https://phabricator.wikimedia.org/T400978) [14:03:47] did some cleanupTitles dry-runs as well just in case, and they all say nothing to update either [14:03:50] (03CR) 10CI reject: [V:04-1] minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [14:03:54] (03PS2) 10Andrew Bogott: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:03:55] !log UTC afternoon backport+config window done [14:03:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] (03PS3) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) [14:04:17] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1100 to an-backup-namenode1001 [14:04:28] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [14:04:37] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:04:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti1053 / ganeti1054 to the production cluster - https://phabricator.wikimedia.org/T401691 (10MoritzMuehlenhoff) 03NEW [14:05:04] Lucas_WMDE: thanks for deploying, created follow up for namespace aliases https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1178000/ will schedule it for next window [14:05:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81129 and previous config saved to /var/cache/conftool/dbconfig/20250812-140523-fceratto.json [14:05:28] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:05:31] great, thank you! [14:05:39] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: Maintenance [14:05:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2203.codfw.wmnet with reason: Maintenance [14:06:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T391056)', diff saved to https://phabricator.wikimedia.org/P81130 and previous config saved to /var/cache/conftool/dbconfig/20250812-140603-fceratto.json [14:07:01] (03CR) 10Btullis: zookeeper: Remove an-druid100[1-2] from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [14:07:14] (03PS3) 10Andrew Bogott: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:07:38] (03PS1) 10Muehlenhoff: Assign Ganeti role to ganeti1053/ganeti1054 [puppet] - 10https://gerrit.wikimedia.org/r/1178002 (https://phabricator.wikimedia.org/T401691) [14:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:08:01] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1100 to an-backup-namenode1001 - btullis@cumin1003" [14:08:12] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:08:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1100 to an-backup-namenode1001 - btullis@cumin1003" [14:08:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:28] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1001 on all recursors [14:08:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1001 on all recursors [14:08:32] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1001 [14:09:24] !incidents [14:09:26] 6567 (ACKED) kafka-jumbo1009/Kafka Broker Server (paged) [14:09:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1001 [14:10:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1100 to an-backup-namenode1001 [14:10:24] I assume expired? [14:10:47] yeah, we ran into this yesterday as well and it seems the host is still up [14:10:50] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1101 to an-backup-namenode1002 [14:11:09] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:11:19] I can't log into the victorops portal though, only getting the spinning cycle, trying from a private browser tab [14:11:33] yeah it's two alerts from ~24h ago that re-upped [14:11:42] acked them via the app [14:12:12] just silenced them via the web as well [14:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:12:57] brouberol: what's the time frame for decomming kafka-jumbo100[89]? we've silenced the alert for 24 hours, or is a longer period needed? [14:13:07] I was writing the same question :) [14:14:38] (03CR) 10Muehlenhoff: [C:03+2] Assign Ganeti role to ganeti1053/ganeti1054 [puppet] - 10https://gerrit.wikimedia.org/r/1178002 (https://phabricator.wikimedia.org/T401691) (owner: 10Muehlenhoff) [14:14:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1101 to an-backup-namenode1002 - btullis@cumin1003" [14:15:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1101 to an-backup-namenode1002 - btullis@cumin1003" [14:15:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:19] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1002 on all recursors [14:15:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1002 on all recursors [14:15:22] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1002 [14:16:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1002 [14:17:10] (03PS2) 10Stevemunene: zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) [14:17:10] (03PS1) 10Stevemunene: zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) [14:17:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1101 to an-backup-namenode1002 [14:17:43] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 (10Andrew) 03NEW [14:22:16] (03CR) 10Gergő Tisza: "Is it worth splitting things on API requests vs. web UI requests (something like `prefix: url.keyword: /w/api.php`)?" [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [14:22:48] (03CR) 10JHathaway: apt: Replace use of legacy facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:24:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T391056)', diff saved to https://phabricator.wikimedia.org/P81131 and previous config saved to /var/cache/conftool/dbconfig/20250812-142400-fceratto.json [14:24:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:24:11] (03CR) 10Stevemunene: zookeeper: Remove an-druid1001 from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [14:24:12] (03PS4) 10Majavah: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) [14:24:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81132 and previous config saved to /var/cache/conftool/dbconfig/20250812-142413-fceratto.json [14:24:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:24:32] (03CR) 10Majavah: apt: Replace use of legacy facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:25:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-namenode1001.eqiad.wmnet with OS bookworm [14:25:47] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6561/console" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:28:00] (03CR) 10JHathaway: [C:03+1] apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:28:06] (03CR) 10Majavah: [V:03+1 C:03+2] apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1430) [14:34:37] moritzm: they are all done [14:34:39] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078332 (10tappof) [14:34:41] *gone [14:34:48] may they rest in peace [14:34:53] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [14:35:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance [14:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [14:37:16] (03PS1) 10Dbrant: Add app_activity_tab event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) [14:37:29] brouberol: hmmh, ok. mysteriously they re-alerted half an hour ago [14:37:48] (03CR) 10Brouberol: [C:03+1] zookeeper: Remove an-druid1001 from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [14:38:01] (03CR) 10Brouberol: [C:03+1] zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [14:38:32] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078351 (10tappof) [14:38:56] moritzm: that's weird. The hosts were fully decommissioned yesterday mid-afternoon [14:39:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81134 and previous config saved to /var/cache/conftool/dbconfig/20250812-143920-fceratto.json [14:39:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant) [14:39:28] I just saw [14:39:28] > yeah, we ran into this yesterday as well and it seems the host is still up [14:39:32] is it? [14:40:11] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078359 (10tappof) [14:40:18] brouberol: no, that was just my initial assumption, given that I saw you running the decom cookbook for 1007, but missed the later ones [14:40:25] oh gotcha. [14:40:36] but the two incidents haven't recovered and are somehow still showing up at https://portal.victorops.com/ui/wikimedia/incidents [14:40:49] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078360 (10tappof) [14:42:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:42:38] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078371 (10tappof) [14:43:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:44:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:45:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:45:55] !incidents [14:45:56] 6570 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:45:57] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:03] !ack 6750 [14:46:04] Attempt to ack incident 6750 failed. [14:46:09] not surprising:( [14:46:12] !ack 6570 [14:46:13] 6570 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:46:13] !ack 6570 [14:46:14] 6570 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:46:17] !incidents [14:46:17] 6570 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:46:18] 6571 (UNACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [14:46:20] !ack 6571 [14:46:21] 6571 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [14:46:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 4 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:46:37] ouch [14:47:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:47:48] (03PS1) 10Brouberol: datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126) [14:48:42] (03CR) 10Btullis: [C:03+1] datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [14:48:44] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078386 (10tappof) @HSwan-WMF, could you please review and approve @egardner’s request? Thank you. [14:50:50] (03CR) 10Brouberol: [C:03+2] datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [14:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:50:57] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:13] (03CR) 10Dzahn: [C:03+1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall) [14:51:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 6 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [14:51:32] !incidents [14:51:32] 6570 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:51:33] 6571 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [14:51:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [14:52:55] PROBLEM - Ensure traffic_server is running for instance backend on cp2030 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:53:00] huh? [14:53:10] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81135 and previous config saved to /var/cache/conftool/dbconfig/20250812-145428-fceratto.json [14:54:55] RECOVERY - Ensure traffic_server is running for instance backend on cp2030 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:56:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [14:56:54] (03CR) 10JHathaway: [C:03+1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [15:00:04] jelto, arnoldokoth, and mutante: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1500) [15:00:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [15:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:01:45] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [15:01:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11078448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [15:02:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [15:02:40] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-namenode1001.eqiad.wmnet with reason: host reimage [15:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:07:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [15:08:10] FIRING: [3x] JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [15:08:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-namenode1001.eqiad.wmnet with reason: host reimage [15:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81136 and previous config saved to /var/cache/conftool/dbconfig/20250812-150935-fceratto.json [15:09:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: Maintenance [15:09:40] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:09:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81137 and previous config saved to /var/cache/conftool/dbconfig/20250812-150944-fceratto.json [15:09:49] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:09:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2199.codfw.wmnet with reason: Maintenance [15:10:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81138 and previous config saved to /var/cache/conftool/dbconfig/20250812-151053-fceratto.json [15:14:02] (03CR) 10Ahmon Dancy: [C:03+1] admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958 (owner: 10Hashar) [15:14:42] (03PS1) 10Brouberol: datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) [15:14:50] (03CR) 10CI reject: [V:04-1] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [15:15:36] (03PS2) 10Brouberol: datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) [15:16:35] (03CR) 10Clément Goubert: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah) [15:19:43] (03PS2) 10Majavah: Add python-trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 [15:19:59] (03CR) 10Majavah: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah) [15:20:57] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:08] (03CR) 10David Caro: [C:03+1] "LGTM did not test all, just a couple + some manually installed packages and such in the trixie container" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:21:08] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah) [15:23:10] RESOLVED: [3x] JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:15] andrew@cumin2002 reimage (PID 2504848) is awaiting input [15:24:55] !log restart varnish-frontend on cp5026 [15:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:13] (03CR) 10Majavah: [C:03+2] Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:25:48] (03Merged) 10jenkins-bot: Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:25:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:25:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81139 and previous config saved to /var/cache/conftool/dbconfig/20250812-152601-fceratto.json [15:26:30] RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [15:27:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-namenode1001.eqiad.wmnet with OS bookworm [15:30:09] (03PS1) 10Majavah: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) [15:30:30] (03CR) 10Majavah: [C:03+2] Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah) [15:30:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:31:01] (03Merged) 10jenkins-bot: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah) [15:31:38] !incidents [15:31:39] 6570 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:31:39] 6571 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [15:32:45] (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall) [15:36:16] (03PS2) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1177469 (owner: 10Ncmonitor) [15:41:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81140 and previous config saved to /var/cache/conftool/dbconfig/20250812-154109-fceratto.json [15:48:49] (03PS1) 10Majavah: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) [15:49:23] !log dancy@deploy1003 Installing scap version "4.199.0" for 2 host(s) [15:51:11] !log dancy@deploy1003 Installation of scap version "4.199.0" completed for 2 hosts [15:51:27] (03CR) 10Majavah: [C:03+2] Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:52:09] (03Merged) 10jenkins-bot: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:54:27] (03CR) 10Btullis: [C:03+1] zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [15:54:34] (03CR) 10Btullis: [C:03+1] zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [15:56:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81141 and previous config saved to /var/cache/conftool/dbconfig/20250812-155616-fceratto.json [15:56:21] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:56:56] (03CR) 10Btullis: [C:03+1] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [15:57:05] (03CR) 10Brouberol: [C:03+2] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol) [15:58:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [15:58:47] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-namenode1002.eqiad.wmnet with OS bookworm [15:59:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [16:00:05] jhathaway and moritzm: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:00:33] RECOVERY - Disk space on an-worker1127 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [16:01:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:01:09] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713) [16:01:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [16:02:12] (03CR) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [16:04:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [16:07:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:08:07] (03PS4) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [16:11:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T401713 [16:13:39] T401713: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T401713 [16:14:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T401713', diff saved to https://phabricator.wikimedia.org/P81142 and previous config saved to /var/cache/conftool/dbconfig/20250812-161402-fceratto.json [16:16:13] RECOVERY - Disk space on an-worker1122 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops [16:16:55] (03PS3) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) [16:18:30] (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [16:22:00] !log Starting s8 codfw failover from db2165 to db2161 - T401713 [16:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:05] T401713: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T401713 [16:23:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T401713', diff saved to https://phabricator.wikimedia.org/P81143 and previous config saved to /var/cache/conftool/dbconfig/20250812-162306-fceratto.json [16:23:37] (03PS1) 10Anzx: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) [16:23:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [16:24:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [16:24:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [16:24:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [16:24:42] (03CR) 10Cwhite: "We would have to normalize it out to avoid cardinality issues." [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [16:24:46] (03PS1) 10Dzahn: lists: delete unused apache.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/1178029 [16:25:22] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178029" [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:25:34] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1178029/6563/" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [16:26:09] (03CR) 10Dzahn: [V:03+1] "also there are RSA acmechief certs reference in this config, and we know that would now result in syntax errors since we don't use and hav" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [16:26:41] !log dancy@deploy1003 Installing scap version "4.200.0" for 2 host(s) [16:28:09] (03CR) 10Vgutierrez: "please take into account that acme-chief still issues RSA certs for mail related systems. To be accurate, the following certs get both rsa" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [16:28:29] !log dancy@deploy1003 Installation of scap version "4.200.0" completed for 2 hosts [16:28:32] (03CR) 10Dzahn: [V:03+1] "the actual template used seems to be modules/profile/templates/lists/apache.conf.epp" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [16:28:59] (03PS4) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) [16:29:47] (03CR) 10Dzahn: [V:03+1] "Ah, thanks for pointing this out! That's good to have in mind. Regardless this template appears to be unused entirely." [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn) [16:31:29] (03CR) 10Stevemunene: [C:03+2] zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene) [16:32:54] (03PS1) 10Dzahn: lists: add NEL headers to apache.conf.epp template [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) [16:34:26] (03CR) 10Dzahn: "The actual header lines are always the same, copied straight from the ticket. Review is just that it doesn't result in a syntax error or s" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:37:13] (03CR) 10Dzahn: [C:03+1] "here is the same thing for gerrit that is already deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175552" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:37:24] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1178032/6566/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:38:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11079185 (10wiki_willy) Thanks @MoritzMuehlenhoff! [16:38:08] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-namenode1002.eqiad.wmnet with reason: host reimage [16:38:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11079186 (10wiki_willy) Awesome, thank you! [16:41:03] 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11079201 (10Novem_Linguae) 05In progress→03Resolved The NDA tools / idp login worked after logging out and logging back in. Thanks for that advice. Most of this ticket is resolve... [16:42:06] (03PS5) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) [16:42:07] (03PS5) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) [16:42:31] (03PS3) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) [16:43:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-namenode1002.eqiad.wmnet with reason: host reimage [16:49:09] (03PS1) 10KartikMistry: Section Translation: Add Arakan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490) [16:52:13] (03CR) 10Dzahn: "the actual header values are always the same, as stated on ticket and already deployed on gerrit here https://gerrit.wikimedia.org/r/c/ope" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [16:52:20] (03PS1) 10Andrew Bogott: sssd: vary sssd.conf file mode depending on distro [puppet] - 10https://gerrit.wikimedia.org/r/1178038 (https://phabricator.wikimedia.org/T401584) [16:56:26] 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11079268 (10BTullis) >>! In T401299#11073333, @VRiley-WMF wrote: > @BTullis I was looking into the unit dumpstata1004-5 and it looks like basic suppo... [16:57:36] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11079272 (10BTullis) a:05BTullis→03None [16:58:06] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11079280 (10BTullis) I still have a few puppet references and secrets to delete, but the hardware is ready to be de-racked wh... [17:00:05] swfrench-wmf and urandom: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1700). [17:00:21] o/ [17:00:22] o/ [17:00:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-namenode1002.eqiad.wmnet with OS bookworm [17:01:05] getting things set up, should be 5m to get everything together [17:01:45] (03CR) 10Gergő Tisza: prometheus: add additional metrics from logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [17:06:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:07:19] (03PS1) 10Majavah: base: gen_fingerprints: Update sshd path for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) [17:07:42] (03Merged) 10jenkins-bot: image-suggestion: reconfigure for data-gateway listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [17:07:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:08:10] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] [17:08:15] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:08:36] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11079350 (10HSwan-WMF) @tappof Approved- thank you! [17:10:13] !log swfrench@deploy1003 swfrench, eevans: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:11:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:53] * swfrench-wmf tries to think whether there's anything testable in mw-debug here [17:12:09] (03CR) 10Andrew Bogott: [C:03+2] sssd: vary sssd.conf file mode depending on distro [puppet] - 10https://gerrit.wikimedia.org/r/1178038 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott) [17:13:14] PROBLEM - MariaDB Replica Lag: s8 #page on db2167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2936.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:22] PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2945.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:38] (03PS3) 10Cwhite: prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 [17:13:50] PROBLEM - MariaDB Replica Lag: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2972.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:51] PROBLEM - MariaDB Replica Lag: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2972.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:54] federico3: is this potentially the s8 switchover ^^ [17:14:06] PROBLEM - MariaDB Replica Lag: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2989.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:14:07] PROBLEM - MariaDB Replica Lag: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2989.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:14:12] PROBLEM - MariaDB Replica Lag: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2994.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:14:25] (03CR) 10FNegri: [C:03+1] "Tested by manually applying to `bastion-codfw1dev-06.bastioninfra-codfw1dev.codfw1dev.wikimedia.cloud`, it worked as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah) [17:14:52] Here [17:15:01] let me check [17:15:36] (03CR) 10Cwhite: prometheus: add additional metrics from logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [17:15:58] heartbeat was not cleaned up [17:15:59] Amir1: thanks! let me know if you need more hands with anything [17:16:02] one second [17:16:07] ah, got it [17:17:35] holding off on proceeding past testservers until this is clear [17:18:29] !incidents [17:18:30] 6574 (ACKED) db2167 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:30] 6575 (UNACKED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:30] 6576 (UNACKED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:30] 6577 (UNACKED) db2152 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:31] 6578 (UNACKED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:31] 6579 (UNACKED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:31] 6580 (UNACKED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [17:18:31] 6570 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [17:18:31] 6571 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [17:18:55] !ack 6575 6576 6577 6578 6579 6580 [17:18:55] Could not ack the alert. Please check the parameters. [17:18:58] * Emperor here [17:19:01] !ack 6575 [17:19:02] 6575 (ACKED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:03] !ack 6576 [17:19:04] 6576 (ACKED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:05] !ack 6577 [17:19:06] 6577 (ACKED) db2152 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:07] !ack 6578 [17:19:08] 6578 (ACKED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:09] !ack 6579 [17:19:10] 6579 (ACKED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:12] !ack 6580 [17:19:13] 6580 (ACKED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [17:19:15] oncallers need any help? [17:19:22] RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:25] there we go [17:19:35] heartbeat on master of codfw was dead [17:19:38] I restarted it [17:19:50] it fixed everything [17:19:50] RECOVERY - MariaDB Replica Lag: s8 #page on db2152 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:51] RECOVERY - MariaDB Replica Lag: s8 #page on db2163 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:20:06] RECOVERY - MariaDB Replica Lag: s8 #page on db2161 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:20:07] RECOVERY - MariaDB Replica Lag: s8 #page on db2154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:20:14] RECOVERY - MariaDB Replica Lag: s8 #page on db2164 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:20:15] RECOVERY - MariaDB Replica Lag: s8 #page on db2167 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:20:18] https://www.irccloud.com/pastebin/Jtpi3n8Y/ [17:20:25] Amir1: thank you very much! sounds like this is just something that was missed in the s8 primary switch in codfw? [17:20:48] yeah [17:20:50] (i.e., not a novel problem of some sort) [17:20:54] cool, thank you [17:21:05] !incidents [17:21:05] 6580 (RESOLVED) db2164 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:05] 6574 (RESOLVED) db2167 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:06] 6579 (RESOLVED) db2154 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:06] 6578 (RESOLVED) db2161 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:06] 6576 (RESOLVED) db2163 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:06] 6577 (RESOLVED) db2152 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:06] 6575 (RESOLVED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [17:21:07] 6570 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [17:21:07] 6571 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [17:21:28] Did those get past the on-callers or is VO paging everyone again? [17:21:57] RECOVERY - Disk space on an-worker1145 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1145&var-datasource=eqiad+prometheus/ops [17:22:02] there was a ~ 5m delay between icinga-wm and the page, so I think it might have just snuck past both oncallers [17:22:36] hey ho [17:25:01] * swfrench-wmf is going to get this backport going again [17:25:04] oh wow [17:25:08] !log swfrench@deploy1003 swfrench, eevans: Continuing with sync [17:26:16] 07sre-alert-triage, 06Data-Persistence: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159#11079450 (10BTullis) [17:30:39] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] (duration: 22m 28s) [17:30:44] T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096 [17:37:33] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah) [17:38:03] (03CR) 10Majavah: [C:03+2] base: gen_fingerprints: Update sshd path for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah) [17:38:06] PROBLEM - MariaDB Replica Lag: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:07] PROBLEM - MariaDB Replica Lag: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:12] PROBLEM - MariaDB Replica Lag: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:16] PROBLEM - MariaDB Replica Lag: s8 #page on db2167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:22] PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:50] PROBLEM - MariaDB Replica Lag: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:51] PROBLEM - MariaDB Replica Lag: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:38:53] again? [17:39:00] seriously [17:39:53] acked all [17:39:59] thanks, rzl! [17:40:37] (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [17:41:11] Let me check [17:41:22] pt-heartbeat-wikimedia.service is dead again on 2161? [17:41:31] from journalctl -fu pt-heartbeat-wikimedia on db2161 I see it started at 17:19:19 by Amir1 and then stopped at 17:27:52 [17:41:32] yeah [17:42:01] hm, concurrent with a puppet run at 17:27:36 [17:42:08] was about to say [17:42:10] or at least suspiciously close [17:42:12] I was just about to ask yeah [17:42:14] this is missing puppet patch [17:42:18] one second [17:42:19] ^ that [17:42:46] the switchover was not done properly [17:42:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178022 [17:43:10] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713) [17:43:14] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713) (owner: 10Gerrit maintenance bot) [17:45:05] running puppet agent [17:45:14] PROBLEM - MariaDB Replica Lag: s8 #page on db2181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1039.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:45:35] !ack 6590 [17:45:35] 6590 (ACKED) db2181 (paged)/MariaDB Replica Lag: s8 (paged) [17:46:14] RECOVERY - MariaDB Replica Lag: s8 #page on db2164 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:15] RECOVERY - MariaDB Replica Lag: s8 #page on db2167 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:16] RECOVERY - MariaDB Replica Lag: s8 #page on db2181 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:21] (03CR) 10Muehlenhoff: admin: stop using groups parsoid-roots and parsoid-admin (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [17:46:23] RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:48] lovely [17:46:49] RECOVERY - MariaDB Replica Lag: s8 #page on db2152 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:46:50] RECOVERY - MariaDB Replica Lag: s8 #page on db2163 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:47:07] RECOVERY - MariaDB Replica Lag: s8 #page on db2154 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:47:08] RECOVERY - MariaDB Replica Lag: s8 #page on db2161 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:47:08] the old primary is now not happy, probably because the row for heartbeat is now messed up [17:47:22] but only one host [17:47:26] https://orchestrator.wikimedia.org/web/cluster/alias/s8 [17:48:23] fixed [17:48:31] Amir1: thank you very much once again! [17:48:35] Amir1: heroic, thank you [17:48:43] <3 [17:49:39] Thanks Amir!! [17:54:52] for future reference this step was missed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178022 [17:54:52] > Merge gerrit puppet change to promote NEW primary: FIXME [17:54:52] https://phabricator.wikimedia.org/T401713 [18:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1800) [18:05:31] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375) [18:05:33] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:06:24] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot) [18:06:35] (03PS1) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:07:05] (03CR) 10CI reject: [V:04-1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:07:41] RECOVERY - Disk space on an-worker1120 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [18:10:35] (03PS2) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:11:01] (03CR) 10CI reject: [V:04-1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:11:13] (03PS3) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:11:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:13:43] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.14 refs T396375 [18:13:47] T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375 [18:13:56] (03PS4) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:14:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:17:10] (03PS5) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:17:14] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:19:25] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11079625 (10KFrancis) Thanks for checking in. No further action is needed from you, @QChris. We're waiting on legal counsel. I just pinged him. [18:20:31] RECOVERY - Disk space on an-worker1121 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [18:21:36] (03PS6) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 [18:21:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:23:33] (03CR) 10Majavah: [C:03+1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:23:57] (03CR) 10Dzahn: [C:03+2] "thanks, Ahmon" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn) [18:24:09] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:24:17] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [18:25:32] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11079651 (10Dzahn) [18:26:24] (03CR) 10Andrew Bogott: [C:03+2] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott) [18:30:16] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11079683 (10Dzahn) deployed on Icinga and integration [18:33:21] RECOVERY - Disk space on an-worker1118 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [18:34:11] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:34:21] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [18:37:59] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:41:59] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:47:24] (03PS1) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) [18:47:33] (03Abandoned) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1173388 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [18:48:01] (03PS2) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) [18:55:02] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11079827 (10ssingh) Hi folks: Thanks for confirming the extent of the changes from fr-tech's side. We discussed this a b... [19:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:09:32] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:11:25] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:14:28] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [19:14:41] (03PS1) 10Cathal Mooney: Add INCLUDEs for Netbox-generated files for new codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240) [19:14:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003" [19:14:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:25:12] (03PS4) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) [19:25:16] (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [19:26:13] RECOVERY - Disk space on an-worker1129 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops [19:27:20] (03CR) 10Ssingh: [C:03+1] "Looks good, compared each v4 and v6 against Netbox." [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240) (owner: 10Cathal Mooney) [19:27:41] (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for Netbox-generated files for new codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240) (owner: 10Cathal Mooney) [19:28:07] !log cmooney@dns2005 START - running authdns-update [19:28:54] andrew@cumin2002 reimage (PID 2607923) is awaiting input [19:28:54] !log cmooney@dns2005 END - running authdns-update [19:29:40] 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11079936 (10egardner) [19:32:03] RECOVERY - Disk space on an-worker1140 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops [19:34:21] (03PS1) 10Hashar: Add option to use the public hostname of a registry [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1178068 (https://phabricator.wikimedia.org/T401733) [19:41:09] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [19:41:22] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [19:49:38] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance [19:50:23] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2213.codfw.wmnet with reason: Maintenance [19:50:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [19:51:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [19:51:35] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [19:53:56] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [19:57:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance [19:57:27] (03PS1) 10Dzahn: zuul: create empty dir /var/lib/zuul on new zuul main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178069 (https://phabricator.wikimedia.org/T395938) [19:57:50] (03CR) 10Dzahn: [C:03+2] zuul: create empty dir /var/lib/zuul on new zuul main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178069 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:59:22] (03CR) 10Gergő Tisza: [C:03+1] prometheus: add additional metrics from logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T2000). [20:00:05] dbrant and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ [20:00:41] o/ I can self-deploy mine [20:01:28] (03PS3) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) [20:01:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [20:02:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant) [20:03:37] (03Merged) 10jenkins-bot: Add app_activity_tab event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant) [20:04:01] !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]] [20:04:06] T399630: Activity Tab: instrumentation - https://phabricator.wikimedia.org/T399630 [20:05:39] anzx: do you need a deployer? [20:05:52] cjming: yes [20:06:08] !log dbrant@deploy1003 dbrant: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:06:12] dbrant: ping me when you're done and i can do the rest in the queue [20:06:19] thx! [20:06:37] np! [20:07:12] !log dbrant@deploy1003 dbrant: Continuing with sync [20:11:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2165.codfw.wmnet with reason: Maintenance [20:12:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T400854)', diff saved to https://phabricator.wikimedia.org/P81144 and previous config saved to /var/cache/conftool/dbconfig/20250812-201205-ladsgroup.json [20:12:10] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:12:42] !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]] (duration: 08m 41s) [20:12:47] T399630: Activity Tab: instrumentation - https://phabricator.wikimedia.org/T399630 [20:12:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:13:20] cjming done! [20:13:31] great - thanks! [20:13:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T400854)', diff saved to https://phabricator.wikimedia.org/P81146 and previous config saved to /var/cache/conftool/dbconfig/20250812-201358-ladsgroup.json [20:14:23] (03PS2) 10Anzx: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) [20:15:15] !log remove thanos-query.discovery.wmnet old puppet cert - T401671 [20:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:19] T401671: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T401671 [20:15:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [20:16:33] (03Merged) 10jenkins-bot: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [20:16:55] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]] [20:17:00] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [20:17:47] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [20:17:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81147 and previous config saved to /var/cache/conftool/dbconfig/20250812-201754-ladsgroup.json [20:17:59] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:19:05] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:19:29] (03CR) 10Cwhite: [C:03+2] prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite) [20:19:44] cjming: look good [20:19:54] cool - syncing [20:19:57] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:20:27] (03PS2) 10Anzx: zghwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) [20:22:45] RECOVERY - Disk space on an-worker1133 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1133&var-datasource=eqiad+prometheus/ops [20:22:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:23:10] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:24:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81148 and previous config saved to /var/cache/conftool/dbconfig/20250812-202437-ladsgroup.json [20:24:43] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [20:25:16] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]] (duration: 08m 21s) [20:25:20] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [20:26:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [20:26:53] (03Merged) 10jenkins-bot: zghwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [20:27:16] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]] [20:27:20] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [20:29:25] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:37] (03PS1) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) [20:29:57] (03PS5) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) [20:30:11] (03CR) 10CI reject: [V:04-1] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:30:23] cjming: looks good [20:30:35] nice [20:30:43] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:31:16] (03PS2) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) [20:31:41] (03CR) 10CI reject: [V:04-1] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:35:58] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]] (duration: 08m 42s) [20:36:03] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [20:36:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [20:37:39] (03Merged) 10jenkins-bot: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx) [20:38:00] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]] [20:38:04] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [20:39:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81149 and previous config saved to /var/cache/conftool/dbconfig/20250812-203945-ladsgroup.json [20:40:05] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:41:47] anzx: ^^? [20:42:31] cjming: ok to proceed [20:42:42] great [20:42:45] !log cjming@deploy1003 cjming, anzx: Continuing with sync [20:45:59] andrew@cumin2002 reimage (PID 2644504) is awaiting input [20:47:06] (03PS4) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) [20:48:02] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]] (duration: 10m 02s) [20:48:07] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [20:48:38] (03PS4) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) [20:48:46] (03CR) 10CI reject: [V:04-1] minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [20:49:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [20:50:28] anzx: can you fix last patch? [20:50:58] (03CR) 10Andrew Bogott: [C:03+2] neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott) [20:54:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81150 and previous config saved to /var/cache/conftool/dbconfig/20250812-205453-ladsgroup.json [20:55:12] (03PS5) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) [20:55:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [20:57:33] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [20:58:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:58:31] cjming: fixed [20:58:50] (03PS6) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) [20:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:59:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [20:59:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [20:59:41] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T2100) [21:00:25] (03Merged) 10jenkins-bot: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx) [21:00:48] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]] [21:00:54] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [21:00:54] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [21:02:54] !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:03:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:03:33] anzx: lmk ^ [21:04:38] cjming: works fine [21:05:44] !log cjming@deploy1003 cjming, anzx: Continuing with sync [21:10:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81151 and previous config saved to /var/cache/conftool/dbconfig/20250812-211001-ladsgroup.json [21:10:06] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:10:16] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [21:10:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81152 and previous config saved to /var/cache/conftool/dbconfig/20250812-211023-ladsgroup.json [21:10:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:11:19] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]] (duration: 10m 31s) [21:11:24] T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785 [21:11:25] T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499 [21:11:25] cjming: please run namespace dupes https://www.irccloud.com/pastebin/0s0zBow1/ [21:11:32] (03CR) 10Cwhite: [C:03+2] mediawiki-global: set sre as receiver of MediaWikiElevatedUnknownLogins [alerts] - 10https://gerrit.wikimedia.org/r/1175578 (https://phabricator.wikimedia.org/T395117) (owner: 10Cwhite) [21:12:16] anzx: also for minwikibooks right? [21:12:28] cjming: hi, looks like the high 5xx rate might be associated with one of those patches, can you check? [21:12:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:29] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:32] `mwscript-k8s --comment=T395499 --follow -- namespaceDupes minwikibooks --fix --add-prefix=T399785 | tee ~/T395499` for minwikibooks [21:12:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:35] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:37] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:43] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:43] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:47] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:49] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:53] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:55] rzl: oh - shoot [21:12:58] ^ expected, but there should be a maintenance window set. fixing [21:12:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81153 and previous config saved to /var/cache/conftool/dbconfig/20250812-211303-ladsgroup.json [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:07] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:08] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:10] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:10] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:12] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:12] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:14] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:14] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:15] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:15] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:16] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:16] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:17] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:18] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:18] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:20] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:22] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:25] (03Merged) 10jenkins-bot: mediawiki-global: set sre as receiver of MediaWikiElevatedUnknownLogins [alerts] - 10https://gerrit.wikimedia.org/r/1175578 (https://phabricator.wikimedia.org/T395117) (owner: 10Cwhite) [21:13:42] sorry for the spam, downtime going up now :) [21:13:46] for clarity, the search spam is expected and is unrelated, the MW 5xx alert is genuine [21:13:50] rzl: https://logstash.wikimedia.org/goto/43235ff51b952d7b80590b895fd761ff - seems to all be s8 / wikidatawiki [21:14:07] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:14:21] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [21:14:28] rzl: not sure what to do for the MW 5xx alerts [21:14:30] swfrench-wmf: hm, nice [21:14:50] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 55 hosts with reason: investigate cluster quorum failure [21:14:51] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:14:55] am i ok to run one more maintenance script? [21:15:09] cjming: did you roll anything out around 20:50 to 20:55, and if so, can you roll it back please? :) [21:16:30] rzl: i started one deployment at 20:36 ending 20:48, and last one 20:59 ending 21:11 [21:16:46] first one is the suspect [21:16:53] just based on timing [21:17:16] looking at the patch I don't see anything likely to cause this, especially with the s8 connection swfrench-wmf points out, but as long as it's easy to rollback and rule it out, let's do that now please [21:17:23] ok - so revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1177936? [21:17:31] fyi anzx ^^ [21:18:33] does that mean i should also rollback #415? 1177936 correlates to #414 [21:19:25] rzl: do you happen to know if there's anything special around wikidata using search? the reason I ask is this maybe correlates with when search-eqiad was depooled https://sal.toolforge.org/log/tNASoJgBffdvpiTr87jw [21:19:33] cjming: as the deployer I need you to make that decision, or to escalate to releng if you need help :) [21:20:05] swfrench-wmf: hm, the errors started rising before that specific log line, but it's plausible [21:20:11] ok - so then i will revert the last 2 deployments in this order: #415, then #414 [21:20:22] ryankemper: can you weigh in on swfrench-wmf's question? [21:20:27] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:20:34] inflatador_ too ^ [21:20:44] should i proceed? [21:20:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:20:59] cjming: yes [21:21:08] swfrench-wmf rzl :eyes [21:21:32] trying to see where the 5xx are coming from [21:21:33] (03PS1) 10Clare Ming: Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 [21:21:59] (03CR) 10Anzx: [C:03+1] Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming) [21:22:35] when we depooled the cluster there shouldn't be immediate impact. at the moment the eqiad cluster was restarted there was very little remaining threadpool activity on opensearch [21:22:40] can i revert both at same time or should i do one at a time? [21:22:49] so I wouldn't expect many mw errors as a result [21:23:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming) [21:23:25] I'm a bit confused about the wikidata part of the question though. I might be missing some context from the backscroll [21:23:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2206.codfw.wmnet with reason: Maintenance [21:23:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81154 and previous config saved to /var/cache/conftool/dbconfig/20250812-212344-fceratto.json [21:23:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:24:17] ryankemper: so, the errors we're seeing are all in wikidata, https://logstash.wikimedia.org/goto/43235ff51b952d7b80590b895fd761ff [21:24:18] ryankemper: inflatador_: onset of errors seems to be closer to 20:52, so unless anything wend sideways around then on the search side of things (i.e., a couple minutes before the depool), then it's probably nothing [21:24:32] (03Merged) 10jenkins-bot: Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming) [21:24:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [21:24:54] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]] [21:25:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [21:25:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10161, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [21:25:13] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10166, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:13] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [21:25:14] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10384, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:14] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57, [21:25:15] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10417, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:15] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58, [21:25:16] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10589, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:17] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_t [21:25:18] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10676, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:18] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58, [21:25:19] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10660, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:19] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58, [21:25:20] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10692, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:25:20] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60, [21:25:21] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11739, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:16] huh, I wonder why that would cause this spike [21:26:35] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1083 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3763, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 625, delayed_unassigned_shards: 0, number_of_pending [21:26:35] 51, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68148, active_shards_percent_as_number: 85.13574660633483 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:35] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3826, relocating_shards: 0, initializing_shards: 43, unassigned_shards: 551, delayed_unassigned_shards: 0, number_of_pending [21:26:35] 54, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 69195, active_shards_percent_as_number: 86.56108597285068 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:39] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3986, relocating_shards: 0, initializing_shards: 39, unassigned_shards: 395, delayed_unassigned_shards: 0, number_of_pending [21:26:39] 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 72671, active_shards_percent_as_number: 90.18099547511312 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:41] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 4014, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 354, delayed_unassigned_shards: 0, number_of_pending [21:26:41] 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 73899, active_shards_percent_as_number: 90.81447963800905 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1089 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4097, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 273, delayed_unassigned_shards: 0, number_of_pending [21:26:45] 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 79167, active_shards_percent_as_number: 92.6923076923077 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1110 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4097, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 273, delayed_unassigned_shards: 0, number_of_pending [21:26:45] 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 79187, active_shards_percent_as_number: 92.6923076923077 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:46] rzl: swfrench-wmf: ah let me inspect that logstash a bit [21:26:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4125, relocating_shards: 0, initializing_shards: 31, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending [21:26:51] 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 85068, active_shards_percent_as_number: 93.32579185520362 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:53] I wonder if this is a coincidence and this is traffic [21:26:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4138, relocating_shards: 0, initializing_shards: 18, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending [21:26:53] 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86269, active_shards_percent_as_number: 93.61990950226244 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1097 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4138, relocating_shards: 0, initializing_shards: 18, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending [21:26:53] 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86273, active_shards_percent_as_number: 93.61990950226244 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:57] I'm more and more convinced the errors are unrelated to both the backport and the search errors, but I'd still like to rule them out conclusively [21:26:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4179, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 229, delayed_unassigned_shards: 0, number_of_pending [21:26:57] 29, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 91366, active_shards_percent_as_number: 94.5475113122172 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending [21:26:59] 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 92161, active_shards_percent_as_number: 95.2262443438914 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1117 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending [21:26:59] 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 92181, active_shards_percent_as_number: 95.2262443438914 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:26:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending [21:27:00] !log cjming@deploy1003 cjming: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:27:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:23] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:23] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:23] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1120 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:24] !log cjming@deploy1003 cjming: Continuing with sync [21:27:24] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:25] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:25] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1070 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:27] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:27] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:27] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:29] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1116 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:30] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:33] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1109 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_ [21:27:33] , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:27:39] rzl: it's shellbox-constraints: https://grafana.wikimedia.org/goto/u-CMub_Ng?orgId=1 [21:27:44] it's wildly overloaded [21:28:07] okay that tracks with the /wiki/Special:ConstraintReport/Qnnnnnn urls [21:28:10] something started around 16:00 and only now has it hit the fan [21:28:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81155 and previous config saved to /var/cache/conftool/dbconfig/20250812-212811-ladsgroup.json [21:28:53] you're right about that delay though, interesting [21:29:13] presumably then the deployments that just happened are not the culprit? [21:29:23] should i continue reverting? [21:29:24] I think the spike of 500s is related to this as well, FWIW. [21:29:53] cjming: let's keep reverting, just to ensure it's not something non-obvious, I'd say. [21:30:06] alrighty [21:30:10] rzl: shall I try throwing pods at the problem as a short-term mitigation? [21:30:23] (03PS1) 10Clare Ming: Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 [21:30:24] (while we sort out the source of traffic) [21:30:39] (03PS1) 10Dzahn: create zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938) [21:30:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:30:50] CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:30:53] swfrench-wmf: yeah, go for it [21:30:53] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:30:59] CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:31:03] belatedly I think I'm the IC :) I'll start a doc shortly [21:31:06] (03PS2) 10Dzahn: create zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938) [21:31:16] rzl: <3 [21:31:28] rzl et al - so I should hold off on the security deployment I wanted to do real quick... [21:31:35] sbassett: yes please [21:31:36] (03CR) 10Anzx: [C:03+1] Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming) [21:31:47] understood [21:32:05] once the incident's over, I think cjming still has the floor but I'll let you both know [21:32:17] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [21:32:28] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [21:32:30] https://trace.wikimedia.org/trace/221540befbfac471f5856d221f9b6675 [21:32:42] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]] (duration: 07m 48s) [21:32:42] ^ jaeger trace for a slow Special:ConstraintReport request, we're living in the future [21:33:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming) [21:33:30] (03PS1) 10Andrew Bogott: neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083 [21:34:03] (03Merged) 10jenkins-bot: Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming) [21:34:12] rzl: doubled shellbox-constraints in codfw to 20 replicas. this is a dirty edit for now, as I'm not quite sure what the right size might be [21:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 9.967% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:34:26] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]] [21:34:38] swfrench-wmf: you might be interested in that jaeger trace, a bunch of fast shellbox calls and a handful of very slow ones [21:34:44] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [21:34:59] we don't have a bunch of healthy pods and one very sad one, do we? [21:35:44] hm, doesn't seem like it [21:35:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:35:45] CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:35:48] (03PS2) 10Andrew Bogott: neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083 [21:36:02] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:36:08] CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:36:13] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:36:16] rzl: I'm going to boop you in another channel [21:36:20] 👍 [21:36:31] !log cjming@deploy1003 cjming: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:37:00] (03PS1) 10Dzahn: trafficserver: create a map for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1178084 [21:37:02] !log cjming@deploy1003 cjming: Continuing with sync [21:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:38:52] (03CR) 10Andrew Bogott: [C:03+2] neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083 (owner: 10Andrew Bogott) [21:42:17] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]] (duration: 07m 51s) [21:42:41] reverts of deployments #415 and #414 are done - i guess maybe that was it? [21:43:02] cjming: looks like we're satisfied this was traffic-related; thanks for rolling back even though it turned out not to be needed <3 please don't roll anything forward yet, let you know soon [21:43:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81156 and previous config saved to /var/cache/conftool/dbconfig/20250812-214318-ladsgroup.json [21:43:30] rzl: sounds good [21:43:45] anzx: sorry about that [21:44:29] cjming: no worries, will probably schedule it for tomorrow, thanks for deploying [21:45:28] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:52:33] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-kfbzh:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [21:54:42] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [21:58:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81157 and previous config saved to /var/cache/conftool/dbconfig/20250812-215826-ladsgroup.json [21:58:31] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [21:58:42] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [21:58:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81158 and previous config saved to /var/cache/conftool/dbconfig/20250812-215849-ladsgroup.json [22:00:05] cjming, sbassett: okay, we're just cleaning up a little from the incident but SRE's comfortable with resuming deployments now, thanks a lot for your patience [22:01:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81159 and previous config saved to /var/cache/conftool/dbconfig/20250812-220132-ladsgroup.json [22:08:30] cjming, sbassett: I'll leave the coordination between you -- cjming if you want to roll forward anzx's changes after all, no objections from me [22:09:27] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:09:31] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:10:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:10:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:16:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81160 and previous config saved to /var/cache/conftool/dbconfig/20250812-221639-ladsgroup.json [22:17:33] RESOLVED: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-kfbzh:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:24:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:30:22] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#11080632 (10Dzahn) [22:31:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81161 and previous config saved to /var/cache/conftool/dbconfig/20250812-223147-ladsgroup.json [22:36:38] FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [22:37:27] (03PS1) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 [22:38:05] (03CR) 10CI reject: [V:04-1] zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (owner: 10Dzahn) [22:38:18] (03CR) 10Dzahn: [C:03+2] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:39:50] (03Merged) 10jenkins-bot: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:40:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:40:46] (03PS1) 10RLazarus: shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 [22:46:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81162 and previous config saved to /var/cache/conftool/dbconfig/20250812-224655-ladsgroup.json [22:47:05] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [22:47:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance [22:47:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81163 and previous config saved to /var/cache/conftool/dbconfig/20250812-224717-ladsgroup.json [22:47:44] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus) [22:48:10] (03CR) 10RLazarus: [C:03+2] shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus) [22:48:17] (03CR) 10Dzahn: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [22:49:44] (03Merged) 10jenkins-bot: shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus) [22:50:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81164 and previous config saved to /var/cache/conftool/dbconfig/20250812-225001-ladsgroup.json [22:51:39] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [22:51:57] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [22:52:30] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11080722 (10bd808) [23:01:21] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:03:00] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1178073/6569/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:04:32] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:05:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81165 and previous config saved to /var/cache/conftool/dbconfig/20250812-230508-ladsgroup.json [23:09:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:20:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81166 and previous config saved to /var/cache/conftool/dbconfig/20250812-232016-ladsgroup.json [23:21:00] (03PS3) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) [23:25:33] (03CR) 10Dzahn: [C:03+2] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [23:35:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81167 and previous config saved to /var/cache/conftool/dbconfig/20250812-233524-ladsgroup.json [23:35:29] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [23:35:40] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [23:35:58] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:36:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T400854)', diff saved to https://phabricator.wikimedia.org/P81168 and previous config saved to /var/cache/conftool/dbconfig/20250812-233605-ladsgroup.json [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103 (owner: 10TrainBranchBot) [23:38:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T400854)', diff saved to https://phabricator.wikimedia.org/P81169 and previous config saved to /var/cache/conftool/dbconfig/20250812-233843-ladsgroup.json [23:45:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:52:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103 (owner: 10TrainBranchBot) [23:53:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P81170 and previous config saved to /var/cache/conftool/dbconfig/20250812-235351-ladsgroup.json