[00:02:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] lists: add NEL headers to apache [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[00:08:02] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526
[00:08:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526 (owner: 10TrainBranchBot)
[00:09:08] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] create th.wikimedia.org for Wikimedia Thailand [dns] - 10https://gerrit.wikimedia.org/r/1177522 (https://phabricator.wikimedia.org/T400001) (owner: 10Dzahn)
[00:10:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "this file is not actually used.. wtf" [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[00:10:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P81002 and previous config saved to /var/cache/conftool/dbconfig/20250812-001022-ladsgroup.json
[00:10:30] <wikibugs>	 (03PS1) 10Dzahn: Revert "lists: add NEL headers to apache" [puppet] - 10https://gerrit.wikimedia.org/r/1177527
[00:10:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] create th.wikimedia.org for Wikimedia Thailand [dns] - 10https://gerrit.wikimedia.org/r/1177522 (https://phabricator.wikimedia.org/T400001) (owner: 10Dzahn)
[00:11:02] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[00:11:43] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "lists: add NEL headers to apache" [puppet] - 10https://gerrit.wikimedia.org/r/1177527 (owner: 10Dzahn)
[00:12:03] <logmsgbot>	 !log dzahn@dns1004 END - running authdns-update
[00:15:48] <wikibugs>	 (03CR) 10Dzahn: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[00:17:24] <wikibugs>	 (03PS5) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[00:17:29] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[00:25:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T400854)', diff saved to https://phabricator.wikimedia.org/P81003 and previous config saved to /var/cache/conftool/dbconfig/20250812-002530-ladsgroup.json
[00:25:34] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[00:25:46] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[00:25:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81004 and previous config saved to /var/cache/conftool/dbconfig/20250812-002553-ladsgroup.json
[00:28:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81005 and previous config saved to /var/cache/conftool/dbconfig/20250812-002841-ladsgroup.json
[00:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1177526 (owner: 10TrainBranchBot)
[00:43:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P81006 and previous config saved to /var/cache/conftool/dbconfig/20250812-004349-ladsgroup.json
[00:45:10] <wikibugs>	 06SRE, 06Traffic-Icebox: Consider using vmod_var instead of temporary headers in VCL - https://phabricator.wikimedia.org/T198620#11076327 (10BCornwall) 05Open→03Invalid This work has actually been ongoing and is tracked in T373550. Closing as a duplicate.
[00:45:23] <wikibugs>	 06SRE, 06Traffic-Icebox: Consider using vmod_var instead of temporary headers in VCL - https://phabricator.wikimedia.org/T198620#11076332 (10BCornwall) →14Duplicate dup:03T373550
[00:50:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup)
[00:58:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P81007 and previous config saved to /var/cache/conftool/dbconfig/20250812-005856-ladsgroup.json
[01:04:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:07:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375)
[01:07:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[01:10:22] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:10:58] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:11:08] <icinga-wm>	 RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[01:14:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T400854)', diff saved to https://phabricator.wikimedia.org/P81008 and previous config saved to /var/cache/conftool/dbconfig/20250812-011403-ladsgroup.json
[01:14:08] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[01:14:20] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[01:14:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81009 and previous config saved to /var/cache/conftool/dbconfig/20250812-011427-ladsgroup.json
[01:18:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81010 and previous config saved to /var/cache/conftool/dbconfig/20250812-011817-ladsgroup.json
[01:19:55] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:20:07] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:21:36] <wikibugs>	 (03PS6) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:21:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.14 [core] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1177545 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[01:22:12] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[01:22:13] <wikibugs>	 (03PS1) 10Samwilson: InitialiseSettings-labs.php: Fix typo in wgWikisourceEnableBulkOCR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177548 (https://phabricator.wikimedia.org/T400281)
[01:24:44] <wikibugs>	 (03Abandoned) 10Samwilson: InitialiseSettings-labs.php: Fix typo in wgWikisourceEnableBulkOCR [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177548 (https://phabricator.wikimedia.org/T400281) (owner: 10Samwilson)
[01:26:34] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[01:26:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[01:33:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P81011 and previous config saved to /var/cache/conftool/dbconfig/20250812-013325-ladsgroup.json
[01:48:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P81012 and previous config saved to /var/cache/conftool/dbconfig/20250812-014833-ladsgroup.json
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0200)
[02:03:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T400854)', diff saved to https://phabricator.wikimedia.org/P81013 and previous config saved to /var/cache/conftool/dbconfig/20250812-020341-ladsgroup.json
[02:03:45] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[02:03:56] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[02:04:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81014 and previous config saved to /var/cache/conftool/dbconfig/20250812-020403-ladsgroup.json
[02:05:27] <wikibugs>	 (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[02:06:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81015 and previous config saved to /var/cache/conftool/dbconfig/20250812-020653-ladsgroup.json
[02:09:12] <wikibugs>	 (03PS7) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[02:12:33] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557
[02:12:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall)
[02:13:46] <wikibugs>	 (03PS2) 10BCornwall: ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557
[02:15:19] <wikibugs>	 (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177470 (owner: 10Ncmonitor)
[02:22:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P81016 and previous config saved to /var/cache/conftool/dbconfig/20250812-022201-ladsgroup.json
[02:27:05] <wikibugs>	 (03CR) 10Pppery: [C:03+1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall)
[02:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[02:37:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P81017 and previous config saved to /var/cache/conftool/dbconfig/20250812-023709-ladsgroup.json
[02:52:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T400854)', diff saved to https://phabricator.wikimedia.org/P81018 and previous config saved to /var/cache/conftool/dbconfig/20250812-025216-ladsgroup.json
[02:52:21] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[02:52:32] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[02:52:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81019 and previous config saved to /var/cache/conftool/dbconfig/20250812-025239-ladsgroup.json
[02:54:32] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11076470 (10Midleading) Please be more clear about the UA policy enforced here. I am always setting the `Api-User-Agent` header in my code with my Wikipedia username inside, but...
[02:55:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81020 and previous config saved to /var/cache/conftool/dbconfig/20250812-025522-ladsgroup.json
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0300)
[03:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[03:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:10:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P81021 and previous config saved to /var/cache/conftool/dbconfig/20250812-031029-ladsgroup.json
[03:17:26] <wikibugs>	 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11076476 (10RLazarus) Thanks @ecarg! I should be able to help with this. A couple of questions, each of them hopeful...
[03:25:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P81022 and previous config saved to /var/cache/conftool/dbconfig/20250812-032537-ladsgroup.json
[03:40:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T400854)', diff saved to https://phabricator.wikimedia.org/P81023 and previous config saved to /var/cache/conftool/dbconfig/20250812-034045-ladsgroup.json
[03:40:49] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[03:41:00] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[03:41:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81024 and previous config saved to /var/cache/conftool/dbconfig/20250812-034107-ladsgroup.json
[03:43:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81025 and previous config saved to /var/cache/conftool/dbconfig/20250812-034353-ladsgroup.json
[03:59:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P81026 and previous config saved to /var/cache/conftool/dbconfig/20250812-035900-ladsgroup.json
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0400)
[04:04:24] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.11 (duration: 04m 19s)
[04:09:51] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401651 (10phaultfinder) 03NEW
[04:14:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P81027 and previous config saved to /var/cache/conftool/dbconfig/20250812-041408-ladsgroup.json
[04:29:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T400854)', diff saved to https://phabricator.wikimedia.org/P81028 and previous config saved to /var/cache/conftool/dbconfig/20250812-042915-ladsgroup.json
[04:29:20] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[04:29:31] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[04:29:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81029 and previous config saved to /var/cache/conftool/dbconfig/20250812-042937-ladsgroup.json
[04:32:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81030 and previous config saved to /var/cache/conftool/dbconfig/20250812-043212-ladsgroup.json
[04:47:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P81031 and previous config saved to /var/cache/conftool/dbconfig/20250812-044719-ladsgroup.json
[05:02:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P81032 and previous config saved to /var/cache/conftool/dbconfig/20250812-050227-ladsgroup.json
[05:08:10] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:17:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T400854)', diff saved to https://phabricator.wikimedia.org/P81033 and previous config saved to /var/cache/conftool/dbconfig/20250812-051735-ladsgroup.json
[05:17:39] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[05:17:50] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[05:17:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81034 and previous config saved to /var/cache/conftool/dbconfig/20250812-051757-ladsgroup.json
[05:20:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81035 and previous config saved to /var/cache/conftool/dbconfig/20250812-052037-ladsgroup.json
[05:35:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P81036 and previous config saved to /var/cache/conftool/dbconfig/20250812-053544-ladsgroup.json
[05:41:58] <icinga-wm>	 PROBLEM - Disk space on an-worker1145 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 155528 MB (4% inode=99%): /var/lib/hadoop/data/f 155831 MB (4% inode=99%): /var/lib/hadoop/data/m 153241 MB (4% inode=99%): /var/lib/hadoop/data/e 159793 MB (4% inode=99%): /var/lib/hadoop/data/c 151986 MB (4% inode=99%): /var/lib/hadoop/data/b 159424 MB (4% inode=99%): /var/lib/hadoop/data/l 154051 MB (4% inode=99%): /var/lib/hadoop/data
[05:41:58] <icinga-wm>	 9 MB (4% inode=99%): /var/lib/hadoop/data/g 150136 MB (3% inode=99%): /var/lib/hadoop/data/j 160361 MB (4% inode=99%): /var/lib/hadoop/data/d 159898 MB (4% inode=99%): /var/lib/hadoop/data/h 157962 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1145&var-datasource=eqiad+prometheus/ops
[05:50:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P81037 and previous config saved to /var/cache/conftool/dbconfig/20250812-055052-ladsgroup.json
[05:56:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399249)', diff saved to https://phabricator.wikimedia.org/P81038 and previous config saved to /var/cache/conftool/dbconfig/20250812-055623-fceratto.json
[05:56:28] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0600)
[06:06:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T400854)', diff saved to https://phabricator.wikimedia.org/P81039 and previous config saved to /var/cache/conftool/dbconfig/20250812-060559-ladsgroup.json
[06:06:04] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[06:06:16] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[06:06:58] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2152.codfw.wmnet with reason: Maintenance
[06:07:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81040 and previous config saved to /var/cache/conftool/dbconfig/20250812-060705-ladsgroup.json
[06:07:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[06:07:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11076699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[06:08:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81041 and previous config saved to /var/cache/conftool/dbconfig/20250812-060857-ladsgroup.json
[06:11:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P81042 and previous config saved to /var/cache/conftool/dbconfig/20250812-061130-fceratto.json
[06:13:10] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:18:32] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[06:22:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[06:24:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P81043 and previous config saved to /var/cache/conftool/dbconfig/20250812-062405-ladsgroup.json
[06:24:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ldap-admins from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1169633 (owner: 10Muehlenhoff)
[06:25:33] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Bump the version numbers for Java images based on latest security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177424 (owner: 10Muehlenhoff)
[06:26:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P81044 and previous config saved to /var/cache/conftool/dbconfig/20250812-062638-fceratto.json
[06:32:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) (owner: 10Muehlenhoff)
[06:32:42] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[06:32:49] <wikibugs>	 (03PS3) 10Muehlenhoff: Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565)
[06:33:45] <wikibugs>	 (03PS14) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[06:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[06:37:40] <icinga-wm>	 RECOVERY - Disk space on an-worker1139 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1139&var-datasource=eqiad+prometheus/ops
[06:38:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[06:38:29] <wikibugs>	 (03PS15) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[06:39:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P81045 and previous config saved to /var/cache/conftool/dbconfig/20250812-063913-ladsgroup.json
[06:39:44] <logmsgbot>	 vriley@cumin1002 provision (PID 1498264) is awaiting input
[06:41:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T399249)', diff saved to https://phabricator.wikimedia.org/P81046 and previous config saved to /var/cache/conftool/dbconfig/20250812-064146-fceratto.json
[06:41:51] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[06:42:02] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance
[06:42:07] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Record LDAP access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1177414 (https://phabricator.wikimedia.org/T400374) (owner: 10Muehlenhoff)
[06:42:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81047 and previous config saved to /var/cache/conftool/dbconfig/20250812-064209-fceratto.json
[06:42:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for thiemowmde [puppet] - 10https://gerrit.wikimedia.org/r/1177414 (https://phabricator.wikimedia.org/T400374) (owner: 10Muehlenhoff)
[06:44:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[06:46:45] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11076744 (10QChris) Seeing this task getting moved on some boards ... I've signed the NDA some days ago.  @KFrancis Is there anything missing from me?
[06:46:56] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Add tracking entry for caro [puppet] - 10https://gerrit.wikimedia.org/r/1177757
[06:51:43] <logmsgbot>	 vriley@cumin1002 provision (PID 1498264) is awaiting input
[06:54:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T400854)', diff saved to https://phabricator.wikimedia.org/P81048 and previous config saved to /var/cache/conftool/dbconfig/20250812-065420-ladsgroup.json
[06:54:25] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[06:54:36] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2154.codfw.wmnet with reason: Maintenance
[06:54:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81049 and previous config saved to /var/cache/conftool/dbconfig/20250812-065443-ladsgroup.json
[06:56:52] <wikibugs>	 (03PS16) 10Muehlenhoff: Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762)
[06:57:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81050 and previous config saved to /var/cache/conftool/dbconfig/20250812-065726-ladsgroup.json
[06:57:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1177757 (owner: 10Slyngshede)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T0700). Please do the needful.
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:03:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[07:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:06:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Add tracking entry for caro [puppet] - 10https://gerrit.wikimedia.org/r/1177757 (owner: 10Slyngshede)
[07:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:12:34] <wikibugs>	 (03PS3) 10Slyngshede: data.yaml: add users as ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1175839 (https://phabricator.wikimedia.org/T400374)
[07:12:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P81051 and previous config saved to /var/cache/conftool/dbconfig/20250812-071234-ladsgroup.json
[07:13:06] <wikibugs>	 (03CR) 10Majavah: [C:03+1] "are you sure this is trixie-specific and not a general fix for T394304? either way, +1 for having this at least on trixie" [puppet] - 10https://gerrit.wikimedia.org/r/1177450 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[07:14:47] <wikibugs>	 (03CR) 10Majavah: Add Trixie images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[07:16:07] <wikibugs>	 (03CR) 10Muehlenhoff: "ready for review, the only difference in PCC compared to the ERB is that the SPDX header no longer appears in the user-visible sshd config" [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[07:17:08] <logmsgbot>	 !log hashar@deploy1003 Started deploy [integration/docroot@77c4765]: build: Updating mediawiki/mediawiki-phan-config to 0.17.0
[07:17:21] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [integration/docroot@77c4765]: build: Updating mediawiki/mediawiki-phan-config to 0.17.0 (duration: 00m 13s)
[07:27:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P81052 and previous config saved to /var/cache/conftool/dbconfig/20250812-072742-ladsgroup.json
[07:29:20] <wikibugs>	 06SRE, 06Traffic, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11076853 (10TheDJ) @Midleading you are always supposed to have a user-agent. Api-user-agent is just for situations where you are unable to MODIFY that agent to provide additiona...
[07:30:32] <icinga-wm>	 RECOVERY - Disk space on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops
[07:42:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T400854)', diff saved to https://phabricator.wikimedia.org/P81053 and previous config saved to /var/cache/conftool/dbconfig/20250812-074249-ladsgroup.json
[07:42:54] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[07:43:05] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
[07:43:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81054 and previous config saved to /var/cache/conftool/dbconfig/20250812-074312-ladsgroup.json
[07:45:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81055 and previous config saved to /var/cache/conftool/dbconfig/20250812-074556-ladsgroup.json
[07:49:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2141.codfw.wmnet with reason: Maintenance
[07:49:25] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: Maintenance
[07:49:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81056 and previous config saved to /var/cache/conftool/dbconfig/20250812-074932-fceratto.json
[07:49:36] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[07:50:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81057 and previous config saved to /var/cache/conftool/dbconfig/20250812-075041-fceratto.json
[07:58:20] <icinga-wm>	 RECOVERY - Host ms-fe2017 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[08:01:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P81058 and previous config saved to /var/cache/conftool/dbconfig/20250812-080104-ladsgroup.json
[08:05:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81059 and previous config saved to /var/cache/conftool/dbconfig/20250812-080549-fceratto.json
[08:07:25] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove the 52 decommissioning hosts from the analytics_hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177353 (https://phabricator.wikimedia.org/T397172) (owner: 10Btullis)
[08:16:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P81060 and previous config saved to /var/cache/conftool/dbconfig/20250812-081611-ladsgroup.json
[08:17:40] <wikibugs>	 (03PS2) 10Ladsgroup: mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806)
[08:17:46] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Completely remove the purge parser cache cron [puppet] - 10https://gerrit.wikimedia.org/r/1175165 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup)
[08:20:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P81061 and previous config saved to /var/cache/conftool/dbconfig/20250812-082056-fceratto.json
[08:27:00] <suzannewoodWMDE2>	 Hi! We want to run a maintenance script to add wikidata support for a new language wikipedia. Let us know if this is a bad time, otherwise we will proceed (#wikidata-for-wikimedia-projects at WMDE) https://phabricator.wikimedia.org/T399789
[08:29:11] <wikibugs>	 (03PS1) 10MVernon: Prepare ms-fe20[17-20] for production use [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225)
[08:31:15] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270)
[08:31:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T400854)', diff saved to https://phabricator.wikimedia.org/P81062 and previous config saved to /var/cache/conftool/dbconfig/20250812-083119-ladsgroup.json
[08:31:23] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[08:31:34] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2163.codfw.wmnet with reason: Maintenance
[08:31:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81063 and previous config saved to /var/cache/conftool/dbconfig/20250812-083141-ladsgroup.json
[08:31:52] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11077061 (10MatthewVernon) Thanks!
[08:34:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81064 and previous config saved to /var/cache/conftool/dbconfig/20250812-083426-ladsgroup.json
[08:35:04] <logmsgbot>	 !log ladsgroup@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[08:35:26] <logmsgbot>	 !log ladsgroup@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[08:36:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T391056)', diff saved to https://phabricator.wikimedia.org/P81065 and previous config saved to /var/cache/conftool/dbconfig/20250812-083603-fceratto.json
[08:36:08] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[08:36:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2146.codfw.wmnet with reason: Maintenance
[08:36:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81066 and previous config saved to /var/cache/conftool/dbconfig/20250812-083637-fceratto.json
[08:37:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81067 and previous config saved to /var/cache/conftool/dbconfig/20250812-083746-fceratto.json
[08:38:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[08:41:08] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[08:41:54] <wikibugs>	 06SRE, 06serviceops-radar: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077095 (10Zabe) p:05Unbreak!→03High Ok it went down a bit, but in [[ https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy1003&var-datasource=000000026&var-cl...
[08:44:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[08:46:13] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "I checked that the hostnames match the description." [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225) (owner: 10MVernon)
[08:46:26] <suzannewoodWMDE2>	 !log suzannewood@deploy1003:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https
[08:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:34] <wikibugs>	 (03Abandoned) 10Federico Ceratto: zarcillo: allow egress to gerrit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170574 (https://phabricator.wikimedia.org/T389663) (owner: 10Federico Ceratto)
[08:48:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5
[08:49:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P81068 and previous config saved to /var/cache/conftool/dbconfig/20250812-084933-ladsgroup.json
[08:52:11] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270)
[08:52:24] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Checking sanitization for wikis rkiwiki, minwikibooks, zghwiktionary, madwikisource, tlwikisource in section s5
[08:52:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81069 and previous config saved to /var/cache/conftool/dbconfig/20250812-085254-fceratto.json
[08:53:57] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[08:54:32] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "there is no objection for sufficient time, let's do this. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe)
[08:54:43] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "Agreed, added my LGTM. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe)
[08:54:53] <urbanecm>	 jouncebot: nowandnext
[08:54:53] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 5 minute(s)
[08:54:53] <jouncebot>	 In 1 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1000)
[08:55:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe)
[08:55:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Store client headers before normalization [puppet] - 10https://gerrit.wikimedia.org/r/1177941 (https://phabricator.wikimedia.org/T400270) (owner: 10Vgutierrez)
[08:55:31] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11077149 (10JAllemandou)
[08:56:11] <wikibugs>	 (03Merged) 10jenkins-bot: Remove centralauth-unmerge from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1174035 (https://phabricator.wikimedia.org/T400755) (owner: 10Zabe)
[08:56:41] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]]
[08:56:45] <stashbot>	 T400755: Remove the stewards ability to delete/unmerge global accounts - https://phabricator.wikimedia.org/T400755
[08:58:43] <claime>	 suzannewoodWMDE2: Hi, next time, could you run it using mwscript-k8s? The tool supports running on multiple wikis and runs on kubernetes. https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Running_on_multiple_wikis_(the_safe_way) The "old" foreachwiki/mwscript wrappers WILL be deprecated completely soon.
[08:58:53] <logmsgbot>	 !log urbanecm@deploy1003 zabe, urbanecm: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:59:26] <claime>	 Please spread the word and/or update docs as well, as I've seen one of your colleagues from WMDE run a script with foreachwiki yesterday as well, but wasn't fast enough to tell them before they left the channel
[09:00:09] <logmsgbot>	 !log urbanecm@deploy1003 zabe, urbanecm: Continuing with sync
[09:01:27] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#11077173 (10JAllemandou)
[09:02:32] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Add CommunityConfigurationExample to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[09:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add CommunityConfigurationExample to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173924 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[09:04:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P81070 and previous config saved to /var/cache/conftool/dbconfig/20250812-090441-ladsgroup.json
[09:05:47] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1174035|Remove centralauth-unmerge from stewards (T400755)]] (duration: 09m 05s)
[09:05:50] <stashbot>	 T400755: Remove the stewards ability to delete/unmerge global accounts - https://phabricator.wikimedia.org/T400755
[09:06:13] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1173924|Add CommunityConfigurationExample to extension-list (T372049)]]
[09:06:17] <stashbot>	 T372049: Enable CommunityConfiguration Example in one beta wiki - https://phabricator.wikimedia.org/T372049
[09:06:40] <joelyrookewmde>	 claime thanks for flagging I'll check out the docs and let my teammates know
[09:07:09] <claime>	 joelyrookewmde: thanks a bunch!
[09:08:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P81071 and previous config saved to /var/cache/conftool/dbconfig/20250812-090802-fceratto.json
[09:11:21] <wikibugs>	 (03PS1) 10Vgutierrez: Match all headers in HAProxy using a variable [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177945
[09:16:43] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+2 C:03+2] Match all headers in HAProxy using a variable [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1177945 (owner: 10Vgutierrez)
[09:17:21] <suzannewoodWMDE2>	 @claime yes, thanks, we will update our documentation
[09:17:36] <zabe>	 !log manually insert 'SecurePoll' into zhwiki.content_models # T401641
[09:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:40] <stashbot>	 T401641: MediaWiki\Storage\NameTableAccessException: No insert possible but primary DB didn't give us a record for 'SecurePoll' in 'content_models' - https://phabricator.wikimedia.org/T401641
[09:17:40] <claime>	 tyvm!
[09:17:47] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "use a var to match all headers on haproxy - vgutierrez@cumin1002"
[09:17:48] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: use a var to match all headers on haproxy - vgutierrez@cumin1002
[09:18:21] <suzannewoodWMDE2>	 !log Finished populateSitesTable for 'zghwiktionary'  https://phabricator.wikimedia.org/T399789
[09:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:38] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: use a var to match all headers on haproxy - vgutierrez@cumin1002
[09:18:39] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "use a var to match all headers on haproxy - vgutierrez@cumin1002"
[09:19:34] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:04-1] "Per internal channels, we probably need to hide the onboarding dialog on wikis where temporary accounts would never be present." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:19:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T400854)', diff saved to https://phabricator.wikimedia.org/P81072 and previous config saved to /var/cache/conftool/dbconfig/20250812-091948-ladsgroup.json
[09:19:53] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[09:20:04] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2164.codfw.wmnet with reason: Maintenance
[09:20:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81073 and previous config saved to /var/cache/conftool/dbconfig/20250812-092011-ladsgroup.json
[09:22:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81074 and previous config saved to /var/cache/conftool/dbconfig/20250812-092255-ladsgroup.json
[09:23:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T391056)', diff saved to https://phabricator.wikimedia.org/P81075 and previous config saved to /var/cache/conftool/dbconfig/20250812-092310-fceratto.json
[09:23:15] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[09:23:27] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: Maintenance
[09:23:30] <suzannewoodWMDE2>	 We have now finished running scripts (#wikidata-for-wikimedia-projects at WMDE)
[09:23:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81076 and previous config saved to /var/cache/conftool/dbconfig/20250812-092334-fceratto.json
[09:24:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81077 and previous config saved to /var/cache/conftool/dbconfig/20250812-092443-fceratto.json
[09:25:02] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Prepare ms-fe20[17-20] for production use [puppet] - 10https://gerrit.wikimedia.org/r/1177940 (https://phabricator.wikimedia.org/T401225) (owner: 10MVernon)
[09:25:48] <wikibugs>	 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077261 (10Clement_Goubert) a:03Clement_Goubert
[09:28:21] <logmsgbot>	 !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-fe[2017-2020].codfw.wmnet with reason: reboot
[09:29:43] <wikibugs>	 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077281 (10Clement_Goubert) The `/srv/docker/overlay2` is taking up 177GB because we keep 7 days of images, which is probably way overkill. I'll run a prune keeping the last 3 days and will update the relevan...
[09:31:43] <wikibugs>	 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077286 (10Clement_Goubert) ` # /usr/bin/docker image prune --all --force --filter until=72h [...] Total reclaimed space: 102.2GB `
[09:33:41] <wikibugs>	 (03PS1) 10Brouberol: Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126)
[09:33:43] <wikibugs>	 (03PS1) 10Brouberol: modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126)
[09:33:44] <wikibugs>	 (03PS1) 10Brouberol: datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126)
[09:33:47] <wikibugs>	 (03PS1) 10Brouberol: datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126)
[09:34:08] <wikibugs>	 (03PS1) 10Clément Goubert: deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647)
[09:34:28] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.661 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:34:28] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.167 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:34:28] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.659 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:35:04] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.302 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:36:10] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:04-1] "I think we can solve this by setting `$wgDefaultUserOptions['checkuser-temporary-accounts-onboarding-dialog-seen'] = true;` for the wikis " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:38:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P81078 and previous config saved to /var/cache/conftool/dbconfig/20250812-093803-ladsgroup.json
[09:38:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[09:39:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81079 and previous config saved to /var/cache/conftool/dbconfig/20250812-093951-fceratto.json
[09:42:08] <wikibugs>	 (03PS1) 10Btullis: Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160)
[09:43:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis)
[09:43:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Re-add the decommisioned hadoop worker hosts to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1177955 (https://phabricator.wikimedia.org/T397160) (owner: 10Btullis)
[09:44:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[09:47:59] <urbanecm>	 uhh... `The connection to the server kubemaster.svc.codfw.wmnet:6443 was refused - did you specify the right host or port`, that doesn't seem right
[09:49:03] <claime>	 Huh
[09:49:16] <claime>	 urbanecm: scap saying that?
[09:49:23] <urbanecm>	 correct
[09:49:32] <urbanecm>	 weirdly enough, it did not abort the deployment
[09:49:35] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1173924|Add CommunityConfigurationExample to extension-list (T372049)]] (duration: 43m 21s)
[09:49:39] <stashbot>	 T372049: Enable CommunityConfiguration Example in one beta wiki - https://phabricator.wikimedia.org/T372049
[09:49:53] <wikibugs>	 (03PS1) 10Hashar: admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958
[09:50:39] <urbanecm>	 claime: https://phabricator.wikimedia.org/P81080 are the logs
[09:51:30] <urbanecm>	 let me know if i should re-deploy my patch or do something else
[09:51:40] <wikibugs>	 (03PS6) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672)
[09:52:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:53:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P81081 and previous config saved to /var/cache/conftool/dbconfig/20250812-095310-ladsgroup.json
[09:54:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11077334 (10MoritzMuehlenhoff)
[09:54:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P81082 and previous config saved to /var/cache/conftool/dbconfig/20250812-095458-fceratto.json
[09:55:30] <hashar>	 !log systemctl start pretrain # T396375
[09:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:34] <stashbot>	 T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375
[09:55:40] <hashar>	 Aborting: git is not clean: /srv/patches
[09:55:41] <hashar>	 of course
[09:55:52] <wikibugs>	 (03CR) 10STran: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[09:56:04] <claime>	 urbanecm: looks like it may just be the progress lookup failing
[09:56:05] <urbanecm>	 hashar: do note my scap got very confused midway
[09:56:10] <urbanecm>	 so i have no idea if my patch is cleanly applied
[09:56:22] <hashar>	 yeah I am not entirely sure what is going on
[09:56:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-fe[2017-2020].codfw.wmnet
[09:56:23] <claime>	 It should be, checking deployments
[09:56:26] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-fe[2017-2020].codfw.wmnet
[09:56:34] <hashar>	 cause `/srv/patches` is clean as far as I can tell
[09:57:09] <hashar>	 I am missing the time when we did a scp to a NFS share and had instant deployment/outage :b
[09:57:11] <claime>	 urbanecm: no diff in codfw deployments
[09:57:16] <claime>	 So I think it's good
[09:57:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add maps201[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177959 (https://phabricator.wikimedia.org/T400637)
[09:57:23] <urbanecm>	 okay, sounds good then. ty for checking!
[09:57:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:58:01] <claime>	 urbanecm: I think what may have happened is the codfw kube apiserver restarted during scap
[09:58:14] <claime>	 I'll check the wikikube-ctrl nodes
[09:59:02] <claime>	 yeah that's exactly that
[09:59:03] <logmsgbot>	 !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1000)
[10:01:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[10:01:33] <hashar>	 !log systemctl start pretrain # T396375
[10:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:38] <stashbot>	 T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375
[10:01:46] <hashar>	 that one happens over night usually
[10:02:11] <hashar>	 that is to rebuild all images from scratch which takes a while
[10:02:28] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:04:43] <claime>	 urbanecm: certificates for the apiservers got updated, so they restarted
[10:05:32] <claime>	 I think all the calls that failed failed at the same time because they were hitting the same apiserver, then it actually hit one that was ok
[10:06:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add maps201[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177959 (https://phabricator.wikimedia.org/T400637) (owner: 10Muehlenhoff)
[10:07:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[10:08:03] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet
[10:08:04] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1017.eqiad.wmnet
[10:08:05] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1017.eqiad.wmnet
[10:08:05] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1017.eqiad.wmnet
[10:08:06] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet
[10:08:07] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1018.eqiad.wmnet
[10:08:07] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1018.eqiad.wmnet
[10:08:08] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1018.eqiad.wmnet
[10:08:09] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet
[10:08:09] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1019.eqiad.wmnet
[10:08:10] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1019.eqiad.wmnet
[10:08:11] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1019.eqiad.wmnet
[10:08:12] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet
[10:08:12] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1020.eqiad.wmnet
[10:08:13] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe1020.eqiad.wmnet
[10:08:14] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1020.eqiad.wmnet
[10:08:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T400854)', diff saved to https://phabricator.wikimedia.org/P81083 and previous config saved to /var/cache/conftool/dbconfig/20250812-100817-ladsgroup.json
[10:08:31] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[10:08:33] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2166.codfw.wmnet with reason: Maintenance
[10:08:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81084 and previous config saved to /var/cache/conftool/dbconfig/20250812-100840-ladsgroup.json
[10:10:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T391056)', diff saved to https://phabricator.wikimedia.org/P81085 and previous config saved to /var/cache/conftool/dbconfig/20250812-101006-fceratto.json
[10:10:11] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:10:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: Maintenance
[10:10:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81086 and previous config saved to /var/cache/conftool/dbconfig/20250812-101029-fceratto.json
[10:11:15] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 12m 12s)
[10:11:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81087 and previous config saved to /var/cache/conftool/dbconfig/20250812-101123-ladsgroup.json
[10:12:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81088 and previous config saved to /var/cache/conftool/dbconfig/20250812-101238-fceratto.json
[10:15:31] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177365 (owner: 10L10n-bot)
[10:18:54] <hashar>	 !log systemctl start train-presync # T396375
[10:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:58] <stashbot>	 T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375
[10:19:40] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375)
[10:19:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[10:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177965 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[10:21:15] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.14  refs T396375
[10:26:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P81089 and previous config saved to /var/cache/conftool/dbconfig/20250812-102631-ladsgroup.json
[10:27:42] <hashar>	 that is train-presync that rebuilds all images from scratches :b
[10:27:46] <hashar>	 I got confused earlier
[10:27:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81090 and previous config saved to /var/cache/conftool/dbconfig/20250812-102746-fceratto.json
[10:27:54] <hashar>	 I am going to have lunch while it is going on
[10:28:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11077422 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T400637#11075140, @wiki_willy wrote: > Hi @MoritzMuehlenhoff - are you able to help confirm...
[10:29:10] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132)
[10:36:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11077435 (10MoritzMuehlenhoff)
[10:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[10:36:39] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm, one nit/hmm" [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert)
[10:37:33] <wikibugs>	 (03PS2) 10Clément Goubert: deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647)
[10:37:35] <wikibugs>	 (03CR) 10Clément Goubert: deployment_server: Prune old images every 3 days (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert)
[10:40:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] deployment_server: Prune old images every 3 days [puppet] - 10https://gerrit.wikimedia.org/r/1177951 (https://phabricator.wikimedia.org/T401647) (owner: 10Clément Goubert)
[10:41:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P81091 and previous config saved to /var/cache/conftool/dbconfig/20250812-104138-ladsgroup.json
[10:42:17] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11077453 (10Clement_Goubert) 05Open→03Resolved Old images will now be pruned every 3 days, and disk space is at manageable levels
[10:42:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P81092 and previous config saved to /var/cache/conftool/dbconfig/20250812-104254-fceratto.json
[10:46:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638)
[10:46:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) (owner: 10Muehlenhoff)
[10:49:17] <wikibugs>	 (03PS2) 10Muehlenhoff: Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638)
[10:52:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add maps101[1-4] to site.pp and preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1177972 (https://phabricator.wikimedia.org/T400638) (owner: 10Muehlenhoff)
[10:55:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11077507 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None >>! In T400638#11075133, @wiki_willy wrote: > Hi @MoritzMuehlenhoff - are you able to confirm the...
[10:56:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T400854)', diff saved to https://phabricator.wikimedia.org/P81093 and previous config saved to /var/cache/conftool/dbconfig/20250812-105646-ladsgroup.json
[10:56:50] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance
[10:56:52] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[10:56:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81094 and previous config saved to /var/cache/conftool/dbconfig/20250812-105657-ladsgroup.json
[10:58:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T391056)', diff saved to https://phabricator.wikimedia.org/P81095 and previous config saved to /var/cache/conftool/dbconfig/20250812-105801-fceratto.json
[10:58:06] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:58:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: Maintenance
[10:58:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81096 and previous config saved to /var/cache/conftool/dbconfig/20250812-105824-fceratto.json
[10:59:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81097 and previous config saved to /var/cache/conftool/dbconfig/20250812-105933-fceratto.json
[10:59:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81098 and previous config saved to /var/cache/conftool/dbconfig/20250812-105941-ladsgroup.json
[11:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:04:20] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.14  refs T396375 (duration: 43m 06s)
[11:04:25] <stashbot>	 T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375
[11:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:05:52] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621)
[11:06:02] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan)
[11:06:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6557/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[11:08:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6558/console" [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[11:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:10:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] haproxy: create the x-provenance map even when not using etcd [puppet] - 10https://gerrit.wikimedia.org/r/1175989 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[11:14:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81099 and previous config saved to /var/cache/conftool/dbconfig/20250812-111440-fceratto.json
[11:14:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P81100 and previous config saved to /var/cache/conftool/dbconfig/20250812-111449-ladsgroup.json
[11:15:21] <wikibugs>	 06SRE, 06SRE Observability: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T401671 (10MoritzMuehlenhoff) 03NEW
[11:15:24] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::instance: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177404 (https://phabricator.wikimedia.org/T401586)
[11:18:56] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::instance: Replace legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177404 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[11:26:30] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049)
[11:26:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:27:40] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Add wmgUseCommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173925 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:28:31] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:28:36] <wikibugs>	 (03PS2) 10Urbanecm: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049)
[11:28:39] <wikibugs>	 (03CR) 10Urbanecm: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:28:41] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:29:23] <moritzm>	 !log installing gnutls  security updates on Bookworm
[11:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:40] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] cswiki: Enable CommunityConfigurationExample [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173926 (https://phabricator.wikimedia.org/T372049) (owner: 10Urbanecm)
[11:29:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P81101 and previous config saved to /var/cache/conftool/dbconfig/20250812-112948-fceratto.json
[11:29:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P81102 and previous config saved to /var/cache/conftool/dbconfig/20250812-112956-ladsgroup.json
[11:44:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T391056)', diff saved to https://phabricator.wikimedia.org/P81103 and previous config saved to /var/cache/conftool/dbconfig/20250812-114455-fceratto.json
[11:45:00] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: Maintenance
[11:45:01] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:45:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T400854)', diff saved to https://phabricator.wikimedia.org/P81104 and previous config saved to /var/cache/conftool/dbconfig/20250812-114504-ladsgroup.json
[11:45:09] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[11:45:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81105 and previous config saved to /var/cache/conftool/dbconfig/20250812-114514-fceratto.json
[11:45:15] <wikibugs>	 (03PS3) 10Amire80: Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464
[11:45:20] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[11:45:20] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2181.codfw.wmnet with reason: Maintenance
[11:45:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81106 and previous config saved to /var/cache/conftool/dbconfig/20250812-114527-ladsgroup.json
[11:45:48] <wikibugs>	 (03Merged) 10jenkins-bot: Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[11:46:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81107 and previous config saved to /var/cache/conftool/dbconfig/20250812-114623-fceratto.json
[11:46:43] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[11:48:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81108 and previous config saved to /var/cache/conftool/dbconfig/20250812-114812-ladsgroup.json
[11:50:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[11:55:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[11:59:51] <wikibugs>	 (03PS1) 10Aklapper: Further reduce AVA debug output [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177976
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1200)
[12:00:10] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Further reduce AVA debug output [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1177976 (owner: 10Aklapper)
[12:00:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1177977 (https://phabricator.wikimedia.org/T400176)
[12:01:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81109 and previous config saved to /var/cache/conftool/dbconfig/20250812-120131-fceratto.json
[12:03:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P81110 and previous config saved to /var/cache/conftool/dbconfig/20250812-120319-ladsgroup.json
[12:03:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11077664 (10MoritzMuehlenhoff) >>! In T400176#11075795, @Novem_Linguae wrote: > Thanks! I just tried to log into a couple of NDA tools such as Superset and Icinga an...
[12:04:39] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1044.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[12:05:10] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[12:05:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11077671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[12:07:38] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "I wanted to deploy the legacy system but there were many unrelated changes so I avoided it" [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[12:07:40] <icinga-wm>	 PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 140872 MB (3% inode=99%): /var/lib/hadoop/data/m 167264 MB (4% inode=99%): /var/lib/hadoop/data/d 154477 MB (4% inode=99%): /var/lib/hadoop/data/b 145991 MB (3% inode=99%): /var/lib/hadoop/data/e 155497 MB (4% inode=99%): /var/lib/hadoop/data/g 148175 MB (3% inode=99%): /var/lib/hadoop/data/f 145370 MB (3% inode=99%): /var/lib/hadoop/data
[12:07:40] <icinga-wm>	 0 MB (4% inode=99%): /var/lib/hadoop/data/i 160191 MB (4% inode=99%): /var/lib/hadoop/data/j 157319 MB (4% inode=99%): /var/lib/hadoop/data/l 162026 MB (4% inode=99%): /var/lib/hadoop/data/c 169422 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops
[12:16:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P81111 and previous config saved to /var/cache/conftool/dbconfig/20250812-121638-fceratto.json
[12:18:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P81112 and previous config saved to /var/cache/conftool/dbconfig/20250812-121827-ladsgroup.json
[12:19:56] <tappof>	 zaway
[12:25:43] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: updating reporting thresholds of PDUs in codfw - https://phabricator.wikimedia.org/T401634#11077788 (10Peachey88)
[12:27:07] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152 (owner: 10PipelineBot)
[12:29:23] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1175152 (owner: 10PipelineBot)
[12:30:37] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker[1096-1099].eqiad.wmnet
[12:30:45] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:31:08] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:31:23] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:31:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T391056)', diff saved to https://phabricator.wikimedia.org/P81113 and previous config saved to /var/cache/conftool/dbconfig/20250812-123145-fceratto.json
[12:31:50] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:31:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: Maintenance
[12:31:52] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:31:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81114 and previous config saved to /var/cache/conftool/dbconfig/20250812-123157-fceratto.json
[12:33:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81115 and previous config saved to /var/cache/conftool/dbconfig/20250812-123306-fceratto.json
[12:33:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T400854)', diff saved to https://phabricator.wikimedia.org/P81116 and previous config saved to /var/cache/conftool/dbconfig/20250812-123334-ladsgroup.json
[12:33:39] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[12:33:50] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2195.codfw.wmnet with reason: Maintenance
[12:33:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81117 and previous config saved to /var/cache/conftool/dbconfig/20250812-123357-ladsgroup.json
[12:34:03] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:34:30] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:34:41] <logmsgbot>	 btullis@cumin1003 decommission (PID 1700582) is awaiting input
[12:35:15] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:35:20] <wikibugs>	 (03CR) 10Btullis: [C:03+1] modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:35:25] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:35:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:36:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81118 and previous config saved to /var/cache/conftool/dbconfig/20250812-123633-ladsgroup.json
[12:41:46] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987
[12:42:24] <wikibugs>	 (03PS3) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767)
[12:42:28] <wikibugs>	 (03PS2) 10Anzx: zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785)
[12:42:30] <wikibugs>	 (03PS2) 10Anzx: tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654)
[12:42:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[12:42:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[12:43:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[12:43:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[12:44:50] <wikibugs>	 (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177370 (owner: 10PipelineBot)
[12:45:26] <logmsgbot>	 btullis@cumin1003 decommission (PID 1700582) is awaiting input
[12:47:40] <icinga-wm>	 PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 150679 MB (4% inode=99%): /var/lib/hadoop/data/m 172247 MB (4% inode=99%): /var/lib/hadoop/data/d 160926 MB (4% inode=99%): /var/lib/hadoop/data/b 156245 MB (4% inode=99%): /var/lib/hadoop/data/e 160367 MB (4% inode=99%): /var/lib/hadoop/data/g 151595 MB (4% inode=99%): /var/lib/hadoop/data/f 150022 MB (3% inode=99%): /var/lib/hadoop/data
[12:47:40] <icinga-wm>	 7 MB (4% inode=99%): /var/lib/hadoop/data/i 166116 MB (4% inode=99%): /var/lib/hadoop/data/j 163782 MB (4% inode=99%): /var/lib/hadoop/data/l 171464 MB (4% inode=99%): /var/lib/hadoop/data/c 173358 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops
[12:47:57] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987 (owner: 10PipelineBot)
[12:48:06] <wikibugs>	 (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177477 (owner: 10PipelineBot)
[12:48:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81119 and previous config saved to /var/cache/conftool/dbconfig/20250812-124814-fceratto.json
[12:48:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:48:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:48:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:48:33] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:49:36] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177987 (owner: 10PipelineBot)
[12:50:23] <wikibugs>	 (03Merged) 10jenkins-bot: Release new version of the external-services-networkpolicy base template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177947 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:50:24] <wikibugs>	 (03Merged) 10jenkins-bot: modules/external-services: support custom annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177948 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:50:49] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: update vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177949 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:50:51] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177950 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[12:50:53] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:50:55] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough
[12:51:22] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:51:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P81120 and previous config saved to /var/cache/conftool/dbconfig/20250812-125140-ladsgroup.json
[12:52:33] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox
[12:52:40] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:53:30] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:53:32] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[12:53:35] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:54:00] <wikibugs>	 (03PS2) 10Anzx: minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499)
[12:54:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx)
[12:54:20] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:54:22] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621)
[12:54:45] <wikibugs>	 (03CR) 10Xcollazo: "Never mind, this patch is already merged, and I can browse to the ZIM files from the index page. Seems like this was a duplicate link." [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric)
[12:57:10] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1096-1099].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[12:57:34] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker[1096-1099].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003"
[12:57:34] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:57:35] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker[1096-1099].eqiad.wmnet
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1300).
[13:00:05] <jouncebot>	 Tran and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:15] <Tran>	 👋
[13:00:15] <anzx>	 o/
[13:01:23] <Tchanders>	 I'll do Tran's
[13:03:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P81121 and previous config saved to /var/cache/conftool/dbconfig/20250812-130321-fceratto.json
[13:03:38] <Tchanders>	 I'll start now
[13:03:48] <Dreamy_Jazz>	 \o
[13:04:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[13:05:10] <wikibugs>	 (03Merged) 10jenkins-bot: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran)
[13:05:10] <Lucas_WMDE>	 o/
[13:05:23] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] Record LDAP access for novemlinguae [puppet] - 10https://gerrit.wikimedia.org/r/1177977 (https://phabricator.wikimedia.org/T400176) (owner: 10Muehlenhoff)
[13:05:30] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough
[13:05:36] <logmsgbot>	 !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]]
[13:05:39] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:06:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P81122 and previous config saved to /var/cache/conftool/dbconfig/20250812-130648-ladsgroup.json
[13:07:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:09:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:10:22] <logmsgbot>	 !log tchanders@deploy1003 stran, tchanders: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:11:08] <wikibugs>	 (03CR) 10Anzx: zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:11:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[13:13:00] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[13:13:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:13:38] <wikibugs>	 (03CR) 10Majavah: [C:03+2] monitoring: Drop use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177401 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[13:14:08] <wikibugs>	 (03PS1) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster This patch updates the current delegations based on  data from the puppet repo hieradata/common/kubernetes.yaml file for the codfw cluster. [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037)
[13:14:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[13:14:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert)
[13:14:53] <wikibugs>	 (03PS2) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster This patch updates the current delegations based on  data from the puppet repo hieradata/common/kubernetes.yaml file for the codfw cluster. [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037)
[13:15:25] <wikibugs>	 (03PS3) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037)
[13:15:44] <Tchanders>	 ...testing...
[13:16:32] <Tchanders>	 looks good, continuing
[13:16:36] <logmsgbot>	 !log tchanders@deploy1003 stran, tchanders: Continuing with sync
[13:16:38] <wikibugs>	 (03PS4) 10Stevemunene: Delegate Kubernetes POD IP reverse ranges for dse-k8s-codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/1177990 (https://phabricator.wikimedia.org/T400037)
[13:18:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T391056)', diff saved to https://phabricator.wikimedia.org/P81123 and previous config saved to /var/cache/conftool/dbconfig/20250812-131829-fceratto.json
[13:18:34] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:18:45] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: Maintenance
[13:18:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81124 and previous config saved to /var/cache/conftool/dbconfig/20250812-131851-fceratto.json
[13:20:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81125 and previous config saved to /var/cache/conftool/dbconfig/20250812-132001-fceratto.json
[13:20:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): zghwiktionary: set sitename, timezone & metanamespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:20:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:21:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11078018 (10VRiley-WMF) cloudcephosd1044 seems to time out during the install. Checked to make sure it was booting from the first disk. I will be looking into the connections very soon.
[13:21:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx)
[13:21:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Add sampling rules for debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1173442 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[13:21:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81126 and previous config saved to /var/cache/conftool/dbconfig/20250812-132155-ladsgroup.json
[13:22:00] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[13:22:11] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance
[13:23:23] <wikibugs>	 (03PS1) 10Hashar: gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669)
[13:23:45] <logmsgbot>	 !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1175113|Enable temporary accounts for special/non-standard/private wikis (T400672)]] (duration: 18m 09s)
[13:23:49] <stashbot>	 T400672: Deploy temporary accounts to special/non-standard/private wikis - https://phabricator.wikimedia.org/T400672
[13:24:33] <wikibugs>	 (03PS1) 10Btullis: Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175)
[13:25:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.479s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:25:52] <Tchanders>	 I'm finished - next deployer please feel free to go ahead
[13:26:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:26:09] <Lucas_WMDE>	 ok!
[13:26:18] <Lucas_WMDE>	 I can deploy some of anzx’ patches :)
[13:27:02] <wikibugs>	 (03CR) 10Anzx: tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:28:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[13:30:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.984s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:30:56] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tlwikisource: add author ( Manunulat ) namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:31:17] <Lucas_WMDE>	 ok, I think four changes are ready to deploy, one has an open question
[13:31:27] <Lucas_WMDE>	 and I think they can all go together, should be safe enough
[13:32:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:32:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:32:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx)
[13:32:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:34:10] <wikibugs>	 (03CR) 10Anzx: madwikisource: set metanamespace, sitename and timezone (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[13:34:53] <wikibugs>	 (03CR) 10SBassett: [C:03+1] prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[13:34:54] <wikibugs>	 (03Merged) 10jenkins-bot: tlwikisource: set timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177547 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:35:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81127 and previous config saved to /var/cache/conftool/dbconfig/20250812-133508-fceratto.json
[13:35:26] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1044.eqiad.wmnet with OS bookworm
[13:35:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11078067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104...
[13:35:43] <wikibugs>	 (03Merged) 10jenkins-bot: zghwiktionary: set sitename, timezone & metanamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177932 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[13:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: minwikibooks: set sitename, metanamespace and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177989 (https://phabricator.wikimedia.org/T395499) (owner: 10Anzx)
[13:36:05] <wikibugs>	 (03Merged) 10jenkins-bot: tlwikisource: add author ( Manunulat ) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) (owner: 10Anzx)
[13:36:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]]
[13:36:36] <stashbot>	 T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654
[13:36:36] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[13:36:37] <stashbot>	 T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499
[13:37:05] <wikibugs>	 (03PS1) 10Majavah: P:docker: Add trixie as a known base image [puppet] - 10https://gerrit.wikimedia.org/r/1177995
[13:38:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]] synced to the testservers (see https://wi
[13:38:32] <logmsgbot>	 kitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:38:41] <Lucas_WMDE>	 anzx: please test :)
[13:39:21] <anzx>	 checking 
[13:39:37] <Lucas_WMDE>	 hm, I think I see a problem on zgh.wiktionary.org
[13:39:55] <Lucas_WMDE>	 previously: ⴰⵎⵙⴳⴷⴰⵍ ⵏ Wiktionary, alias Wiktionary talk
[13:40:03] <Lucas_WMDE>	 now: ⴰⵎⵙⴳⴷⴰⵍ ⵏ ⵡⵉⴽⵉⵎⴰⵡⴰⵍ, alias ⵡⵉⴽⵉⵎⴰⵡⴰⵍ talk
[13:40:27] <Lucas_WMDE>	 (actually, scratch the alias part, there’s still a “Wiktionary talk” alias in place)
[13:40:39] <Lucas_WMDE>	 but we might want a new alias for ⴰⵎⵙⴳⴷⴰⵍ ⵏ Wiktionary?
[13:40:44] <Lucas_WMDE>	 (could be done in a follow-up change)
[13:40:55] <wikibugs>	 (03CR) 10Hashar: "> Do we also want to rename that user in the database later?" [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar)
[13:41:36] <wikibugs>	 (03PS1) 10Majavah: Add python-trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998
[13:42:31] <wikibugs>	 (03CR) 10Hashar: "Google search console reports them with:" [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[13:42:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:43:25] <wikibugs>	 (03CR) 10Majavah: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah)
[13:44:06] <Lucas_WMDE>	 and something similar on min.wikibooks.org as well, if I’m not mistaken
[13:44:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet
[13:45:24] <Lucas_WMDE>	 with the change in effect, “Rundiang Wikibooks” and “Pembicaraan Wikibooks” no longer exist as namespaces or aliases, so new aliases are probably a good idea?
[13:47:45] <anzx>	 Lucas_WMDE: i will create follow up for adding new aliases, I think this happens for talkpage namespace where project names comes in end
[13:48:23] <Lucas_WMDE>	 sounds good
[13:48:35] <Lucas_WMDE>	 do you want to test anything else or should we go ahead with these changes for now?
[13:49:10] <anzx>	 Lucas_WMDE: checked others, good to sync
[13:49:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Continuing with sync
[13:49:30] <Lucas_WMDE>	 alright, thanks!
[13:49:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet
[13:50:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P81128 and previous config saved to /var/cache/conftool/dbconfig/20250812-135016-fceratto.json
[13:50:46] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox
[13:51:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet
[13:51:57] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:52:36] <wikibugs>	 (03PS2) 10Btullis: Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175)
[13:52:51] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:53:36] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: Remove an-druid100[1-2] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330)
[13:53:51] <Lucas_WMDE>	 anzx: also, I’m not seeing a new patch set in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1177936 yet, did you mean to upload one?
[13:53:57] <Lucas_WMDE>	 (or maybe I misunderstood your comment)
[13:54:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177547|tlwikisource: set timezone (T388654)]], [[gerrit:1177932|zghwiktionary: set sitename, timezone & metanamespace (T399785)]], [[gerrit:1177989|minwikibooks: set sitename, metanamespace and timezone (T395499)]], [[gerrit:1176509|tlwikisource: add author ( Manunulat ) namespace (T388654)]] (duration: 18m 21s)
[13:54:51] <wikibugs>	 (03PS4) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767)
[13:54:56] <stashbot>	 T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654
[13:54:56] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[13:54:57] <stashbot>	 T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499
[13:54:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T401651#11078134 (10phaultfinder)
[13:55:19] <anzx>	 Lucas_WMDE: published edit
[13:55:41] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[13:55:43] <Lucas_WMDE>	 ah, now it’s there :)
[13:55:59] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add the new an-backup-namenode hosts to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1177993 (https://phabricator.wikimedia.org/T397175) (owner: 10Btullis)
[13:56:13] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2229.codfw.wmnet with reason: Maintenance
[13:56:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet
[13:56:36] <Lucas_WMDE>	 jouncebot: nowandnext
[13:56:36] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1300)
[13:56:36] <jouncebot>	 In 0 hour(s) and 33 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1430)
[13:56:44] <wikibugs>	 (03CR) 10Jelto: [C:03+1] gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[13:56:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] "I'm not sure, although certainly we don't get the same complaint about restarting on Bookworm." [puppet] - 10https://gerrit.wikimedia.org/r/1177450 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[13:57:12] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: prevent crawling authenticated URLs [puppet] - 10https://gerrit.wikimedia.org/r/1177992 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[13:57:56] <Lucas_WMDE>	 I don’t think we have enough time for the madwikisource change tbh
[13:58:07] <Lucas_WMDE>	 but let me see which maintenance scripts should be run on the wikis that got deployments
[13:58:23] <anzx>	 Lucas_WMDE: i can schedule all others to next window 
[13:58:25] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:58:28] <Lucas_WMDE>	 sounds good, thanks!
[13:58:31] <Lucas_WMDE>	 namespaceDupes I think
[13:59:25] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance
[13:59:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes zghwiktionary --fix  # T399785
[14:00:09] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:00:13] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:01:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes tlwikisource --fix  # T388654
[14:01:15] <stashbot>	 T388654: Post-creation work for tlwikisource - https://phabricator.wikimedia.org/T388654
[14:01:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 mwscript-k8s job started: namespaceDupes minwikibooks --fix  # T395499
[14:01:56] <stashbot>	 T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499
[14:03:11] <wikibugs>	 (03PS2) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785)
[14:03:19] <wikibugs>	 (03PS1) 10Fabfur: profile,prometheus,haproxykafka: support for rdkafka metrics [puppet] - 10https://gerrit.wikimedia.org/r/1178001 (https://phabricator.wikimedia.org/T400978)
[14:03:47] <Lucas_WMDE>	 did some cleanupTitles dry-runs as well just in case, and they all say nothing to update either
[14:03:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[14:03:54] <wikibugs>	 (03PS2) 10Andrew Bogott: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:03:55] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:03:56] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:04] <wikibugs>	 (03PS3) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785)
[14:04:17] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1100 to an-backup-namenode1001
[14:04:28] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[14:04:37] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[14:04:56] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti1053 / ganeti1054 to the production cluster - https://phabricator.wikimedia.org/T401691 (10MoritzMuehlenhoff) 03NEW
[14:05:04] <anzx>	 Lucas_WMDE: thanks for deploying, created follow up for namespace aliases https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1178000/ will schedule it for next window 
[14:05:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T391056)', diff saved to https://phabricator.wikimedia.org/P81129 and previous config saved to /var/cache/conftool/dbconfig/20250812-140523-fceratto.json
[14:05:28] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:05:31] <Lucas_WMDE>	 great, thank you!
[14:05:39] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2202.codfw.wmnet with reason: Maintenance
[14:05:56] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2203.codfw.wmnet with reason: Maintenance
[14:06:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T391056)', diff saved to https://phabricator.wikimedia.org/P81130 and previous config saved to /var/cache/conftool/dbconfig/20250812-140603-fceratto.json
[14:07:01] <wikibugs>	 (03CR) 10Btullis: zookeeper: Remove an-druid100[1-2] from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[14:07:14] <wikibugs>	 (03PS3) 10Andrew Bogott: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:07:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign Ganeti role to ganeti1053/ganeti1054 [puppet] - 10https://gerrit.wikimedia.org/r/1178002 (https://phabricator.wikimedia.org/T401691)
[14:07:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:08:01] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1100 to an-backup-namenode1001 - btullis@cumin1003"
[14:08:12] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:08:28] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1100 to an-backup-namenode1001 - btullis@cumin1003"
[14:08:28] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:08:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1001 on all recursors
[14:08:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1001 on all recursors
[14:08:32] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1001
[14:09:24] <moritzm>	 !incidents
[14:09:26] <sirenbot>	 6567 (ACKED)  kafka-jumbo1009/Kafka Broker Server (paged)
[14:09:33] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1001
[14:10:12] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1100 to an-backup-namenode1001
[14:10:24] <bblack>	 I assume expired?
[14:10:47] <moritzm>	 yeah, we ran into this yesterday as well and it seems the host is still up
[14:10:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1101 to an-backup-namenode1002
[14:11:09] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.netbox
[14:11:19] <moritzm>	 I can't log into the victorops portal though, only getting the spinning cycle, trying from a private browser tab
[14:11:33] <bblack>	 yeah it's two alerts from ~24h ago that re-upped
[14:11:42] <vgutierrez>	 acked them via the app
[14:12:12] <moritzm>	 just silenced them via the web as well
[14:12:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[14:12:57] <moritzm>	 brouberol: what's the time frame for decomming kafka-jumbo100[89]? we've silenced the alert for 24 hours, or is a longer period needed?
[14:13:07] <vgutierrez>	 I was writing the same question :)
[14:14:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Assign Ganeti role to ganeti1053/ganeti1054 [puppet] - 10https://gerrit.wikimedia.org/r/1178002 (https://phabricator.wikimedia.org/T401691) (owner: 10Muehlenhoff)
[14:14:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1101 to an-backup-namenode1002 - btullis@cumin1003"
[14:15:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1101 to an-backup-namenode1002 - btullis@cumin1003"
[14:15:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:15:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1002 on all recursors
[14:15:22] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1002 on all recursors
[14:15:22] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1002
[14:16:46] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1002
[14:17:10] <wikibugs>	 (03PS2) 10Stevemunene: zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330)
[14:17:10] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330)
[14:17:25] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1101 to an-backup-namenode1002
[14:17:43] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 (10Andrew) 03NEW
[14:22:16] <wikibugs>	 (03CR) 10Gergő Tisza: "Is it worth splitting things on API requests vs. web UI requests (something like `prefix: url.keyword: /w/api.php`)?" [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[14:22:48] <wikibugs>	 (03CR) 10JHathaway: apt: Replace use of legacy facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:24:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T391056)', diff saved to https://phabricator.wikimedia.org/P81131 and previous config saved to /var/cache/conftool/dbconfig/20250812-142400-fceratto.json
[14:24:05] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:24:11] <wikibugs>	 (03CR) 10Stevemunene: zookeeper: Remove an-druid1001 from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[14:24:12] <wikibugs>	 (03PS4) 10Majavah: apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586)
[14:24:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81132 and previous config saved to /var/cache/conftool/dbconfig/20250812-142413-fceratto.json
[14:24:17] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[14:24:32] <wikibugs>	 (03CR) 10Majavah: apt: Replace use of legacy facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:25:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-namenode1001.eqiad.wmnet with OS bookworm
[14:25:47] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6561/console" [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:28:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:28:06] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] apt: Replace use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1177403 (https://phabricator.wikimedia.org/T401586) (owner: 10Majavah)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1430)
[14:34:37] <brouberol>	 moritzm: they are all done
[14:34:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078332 (10tappof)
[14:34:41] <brouberol>	 *gone
[14:34:48] <brouberol>	 may they rest in peace
[14:34:53] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[14:35:35] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2220.codfw.wmnet with reason: Maintenance
[14:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[14:37:16] <wikibugs>	 (03PS1) 10Dbrant: Add app_activity_tab event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630)
[14:37:29] <moritzm>	 brouberol: hmmh, ok. mysteriously they re-alerted half an hour ago
[14:37:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] zookeeper: Remove an-druid1001 from the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[14:38:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[14:38:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078351 (10tappof)
[14:38:56] <brouberol>	 moritzm: that's weird. The hosts were fully decommissioned yesterday mid-afternoon
[14:39:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81134 and previous config saved to /var/cache/conftool/dbconfig/20250812-143920-fceratto.json
[14:39:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant)
[14:39:28] <brouberol>	 I just saw
[14:39:28] <brouberol>	 > yeah, we ran into this yesterday as well and it seems the host is still up
[14:39:32] <brouberol>	 is it?
[14:40:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078359 (10tappof)
[14:40:18] <moritzm>	 brouberol: no, that was just my initial assumption, given that I saw you running the decom cookbook for 1007, but missed the later ones
[14:40:25] <brouberol>	 oh gotcha.
[14:40:36] <moritzm>	 but the two incidents haven't recovered and are somehow still showing up at https://portal.victorops.com/ui/wikimedia/incidents
[14:40:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078360 (10tappof)
[14:42:14] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[14:42:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078371 (10tappof)
[14:43:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[14:44:31] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[14:45:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[14:45:55] <vgutierrez>	 !incidents
[14:45:56] <sirenbot>	 6570 (UNACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:45:57] <jinxer-wm>	 FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:46:03] <vgutierrez>	 !ack 6750
[14:46:04] <sirenbot>	 Attempt to ack incident 6750 failed.
[14:46:09] <sukhe>	 not surprising:(
[14:46:12] <vgutierrez>	 !ack 6570
[14:46:13] <sirenbot>	 6570 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:46:13] <sukhe>	 !ack 6570
[14:46:14] <sirenbot>	 6570 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:46:17] <vgutierrez>	 !incidents
[14:46:17] <sirenbot>	 6570 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:46:18] <sirenbot>	 6571 (UNACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[14:46:20] <vgutierrez>	 !ack 6571
[14:46:21] <sirenbot>	 6571 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[14:46:30] <jinxer-wm>	 FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 4 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[14:46:37] <sukhe>	 ouch
[14:47:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[14:47:48] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126)
[14:48:42] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[14:48:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11078386 (10tappof) @HSwan-WMF, could you please review and approve @egardner’s request? Thank you.
[14:50:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub-next: deploy the external services networkpolicies before the setup jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178010 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[14:50:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:50:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall)
[14:51:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 6 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[14:51:32] <brett>	 !incidents
[14:51:32] <sirenbot>	 6570 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[14:51:33] <sirenbot>	 6571 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[14:51:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[14:52:55] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp2030 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:53:00] <sukhe>	 huh?
[14:53:10] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81135 and previous config saved to /var/cache/conftool/dbconfig/20250812-145428-fceratto.json
[14:54:55] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp2030 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:56:09] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[14:56:54] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Convert sshd config for trixie and later to an EPP template [puppet] - 10https://gerrit.wikimedia.org/r/1166388 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff)
[15:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: May I have your attention please! SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1500)
[15:00:08] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[15:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:01:45] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[15:01:56] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11078448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[15:02:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[15:02:40] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-namenode1001.eqiad.wmnet with reason: host reimage
[15:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:07:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[15:08:10] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[15:08:50] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-namenode1001.eqiad.wmnet with reason: host reimage
[15:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T399249)', diff saved to https://phabricator.wikimedia.org/P81136 and previous config saved to /var/cache/conftool/dbconfig/20250812-150935-fceratto.json
[15:09:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2216.codfw.wmnet with reason: Maintenance
[15:09:40] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[15:09:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81137 and previous config saved to /var/cache/conftool/dbconfig/20250812-150944-fceratto.json
[15:09:49] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:09:51] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2199.codfw.wmnet with reason: Maintenance
[15:10:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81138 and previous config saved to /var/cache/conftool/dbconfig/20250812-151053-fceratto.json
[15:14:02] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] admin: allow systemctl status for MediaWiki train [puppet] - 10https://gerrit.wikimedia.org/r/1177958 (owner: 10Hashar)
[15:14:42] <wikibugs>	 (03PS1) 10Brouberol: datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126)
[15:14:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[15:15:36] <wikibugs>	 (03PS2) 10Brouberol: datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126)
[15:16:35] <wikibugs>	 (03CR) 10Clément Goubert: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah)
[15:19:43] <wikibugs>	 (03PS2) 10Majavah: Add python-trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998
[15:19:59] <wikibugs>	 (03CR) 10Majavah: Add python-trixie (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah)
[15:20:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:08] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM did not test all, just a couple + some manually installed packages and such in the trixie container" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[15:21:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1177998 (owner: 10Majavah)
[15:23:10] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:24:15] <logmsgbot>	 andrew@cumin2002 reimage (PID 2504848) is awaiting input
[15:24:55] <sukhe>	 !log restart varnish-frontend on cp5026
[15:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[15:25:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[15:25:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[15:25:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:26:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81139 and previous config saved to /var/cache/conftool/dbconfig/20250812-152601-fceratto.json
[15:26:30] <jinxer-wm>	 RESOLVED: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 3 unhealthy realservers pooled on lvs5005:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[15:27:06] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-namenode1001.eqiad.wmnet with OS bookworm
[15:30:09] <wikibugs>	 (03PS1) 10Majavah: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902)
[15:30:30] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah)
[15:30:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[15:31:01] <wikibugs>	 (03Merged) 10jenkins-bot: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah)
[15:31:38] <sukhe>	 !incidents
[15:31:39] <sirenbot>	 6570 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[15:31:39] <sirenbot>	 6571 (RESOLVED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[15:32:45] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ncmonitor: Ignore cricketinfo sites [puppet] - 10https://gerrit.wikimedia.org/r/1177557 (owner: 10BCornwall)
[15:36:16] <wikibugs>	 (03PS2) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1177469 (owner: 10Ncmonitor)
[15:41:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P81140 and previous config saved to /var/cache/conftool/dbconfig/20250812-154109-fceratto.json
[15:48:49] <wikibugs>	 (03PS1) 10Majavah: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255)
[15:49:23] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.199.0" for 2 host(s)
[15:51:11] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.199.0" completed for 2 hosts
[15:51:27] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[15:52:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah)
[15:54:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[15:54:34] <wikibugs>	 (03CR) 10Btullis: [C:03+1] zookeeper: Remove an-druid1002 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1178006 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[15:56:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T391056)', diff saved to https://phabricator.wikimedia.org/P81141 and previous config saved to /var/cache/conftool/dbconfig/20250812-155616-fceratto.json
[15:56:21] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:56:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[15:57:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: ensure we keep the network policies after the helm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178014 (https://phabricator.wikimedia.org/T395126) (owner: 10Brouberol)
[15:58:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[15:58:47] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-namenode1002.eqiad.wmnet with OS bookworm
[15:59:30] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[16:00:05] <jouncebot>	 jhathaway and moritzm: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[16:00:33] <icinga-wm>	 RECOVERY - Disk space on an-worker1127 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops
[16:01:02] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[16:01:09] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713)
[16:01:38] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[16:02:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto)
[16:04:23] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[16:07:45] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:08:07] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621)
[16:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:13:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T401713
[16:13:39] <stashbot>	 T401713: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T401713
[16:14:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T401713', diff saved to https://phabricator.wikimedia.org/P81142 and previous config saved to /var/cache/conftool/dbconfig/20250812-161402-fceratto.json
[16:16:13] <icinga-wm>	 RECOVERY - Disk space on an-worker1122 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1122&var-datasource=eqiad+prometheus/ops
[16:16:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621)
[16:18:30] <wikibugs>	 (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn)
[16:22:00] <federico3>	 !log Starting s8 codfw failover from db2165 to db2161 - T401713
[16:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:05] <stashbot>	 T401713: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T401713
[16:23:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T401713', diff saved to https://phabricator.wikimedia.org/P81143 and previous config saved to /var/cache/conftool/dbconfig/20250812-162306-fceratto.json
[16:23:37] <wikibugs>	 (03PS1) 10Anzx: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767)
[16:23:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[16:24:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[16:24:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[16:24:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 12 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[16:24:42] <wikibugs>	 (03CR) 10Cwhite: "We would have to normalize it out to avoid cardinality issues." [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[16:24:46] <wikibugs>	 (03PS1) 10Dzahn: lists: delete unused apache.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/1178029
[16:25:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178029" [puppet] - 10https://gerrit.wikimedia.org/r/1177455 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[16:25:34] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1178029/6563/" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn)
[16:26:09] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "also there are RSA acmechief certs reference in this config, and we know that would now result in syntax errors since we don't use and hav" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn)
[16:26:41] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.200.0" for 2 host(s)
[16:28:09] <wikibugs>	 (03CR) 10Vgutierrez: "please take into account that acme-chief still issues RSA certs for mail related systems. To be accurate, the following certs get both rsa" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn)
[16:28:29] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.200.0" completed for 2 hosts
[16:28:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "the actual template used seems to be modules/profile/templates/lists/apache.conf.epp" [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn)
[16:28:59] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621)
[16:29:47] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "Ah, thanks for pointing this out! That's good to have in mind. Regardless this template appears to be unused entirely." [puppet] - 10https://gerrit.wikimedia.org/r/1178029 (owner: 10Dzahn)
[16:31:29] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] zookeeper: Remove an-druid1001 from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1177999 (https://phabricator.wikimedia.org/T400330) (owner: 10Stevemunene)
[16:32:54] <wikibugs>	 (03PS1) 10Dzahn: lists: add NEL headers to apache.conf.epp template [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725)
[16:34:26] <wikibugs>	 (03CR) 10Dzahn: "The actual header lines are always the same, copied straight from the ticket. Review is just that it doesn't result in a syntax error or s" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[16:37:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "here is the same thing for gerrit that is already deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175552" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[16:37:24] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1178032/6566/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178032 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[16:38:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11079185 (10wiki_willy) Thanks @MoritzMuehlenhoff!
[16:38:08] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-namenode1002.eqiad.wmnet with reason: host reimage
[16:38:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11079186 (10wiki_willy) Awesome, thank you!
[16:41:03] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda & logstash for Novem Linguae - https://phabricator.wikimedia.org/T400176#11079201 (10Novem_Linguae) 05In progress→03Resolved The NDA tools / idp login worked after logging out and logging back in. Thanks for that advice.  Most of this ticket is resolve...
[16:42:06] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621)
[16:42:07] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621)
[16:42:31] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621)
[16:43:02] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-namenode1002.eqiad.wmnet with reason: host reimage
[16:49:09] <wikibugs>	 (03PS1) 10KartikMistry: Section Translation: Add Arakan Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178036 (https://phabricator.wikimedia.org/T392490)
[16:52:13] <wikibugs>	 (03CR) 10Dzahn: "the actual header values are always the same, as stated on ticket and already deployed on gerrit here https://gerrit.wikimedia.org/r/c/ope" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[16:52:20] <wikibugs>	 (03PS1) 10Andrew Bogott: sssd: vary sssd.conf file mode depending on distro [puppet] - 10https://gerrit.wikimedia.org/r/1178038 (https://phabricator.wikimedia.org/T401584)
[16:56:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11079268 (10BTullis) >>! In T401299#11073333, @VRiley-WMF wrote: > @BTullis I was looking into the unit dumpstata1004-5 and it looks like basic suppo...
[16:57:36] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11079272 (10BTullis) a:05BTullis→03None
[16:58:06] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): decommission an-worker109[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T401678#11079280 (10BTullis) I still have a few puppet references and secrets to delete, but the hardware is ready to be de-racked wh...
[17:00:05] <jouncebot>	 swfrench-wmf and urandom: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1700).
[17:00:21] <swfrench-wmf>	 o/
[17:00:22] <urandom>	 o/
[17:00:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-namenode1002.eqiad.wmnet with OS bookworm
[17:01:05] <swfrench-wmf>	 getting things set up, should be 5m to get everything together
[17:01:45] <wikibugs>	 (03CR) 10Gergő Tisza: prometheus: add additional metrics from logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[17:06:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans)
[17:07:19] <wikibugs>	 (03PS1) 10Majavah: base: gen_fingerprints: Update sshd path for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762)
[17:07:42] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: reconfigure for data-gateway listener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1171702 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans)
[17:07:45] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:08:10] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]]
[17:08:15] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[17:08:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11079350 (10HSwan-WMF) @tappof  Approved- thank you!
[17:10:13] <logmsgbot>	 !log swfrench@deploy1003 swfrench, eevans: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:11:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:11:53] * swfrench-wmf tries to think whether there's anything testable in mw-debug here
[17:12:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] sssd: vary sssd.conf file mode depending on distro [puppet] - 10https://gerrit.wikimedia.org/r/1178038 (https://phabricator.wikimedia.org/T401584) (owner: 10Andrew Bogott)
[17:13:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2936.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:13:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2945.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:13:38] <wikibugs>	 (03PS3) 10Cwhite: prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569
[17:13:50] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2972.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:13:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2972.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:13:54] <swfrench-wmf>	 federico3: is this potentially the s8 switchover ^^
[17:14:06] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2989.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:14:07] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2989.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:14:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2994.77 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:14:25] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Tested by manually applying to `bastion-codfw1dev-06.bastioninfra-codfw1dev.codfw1dev.wikimedia.cloud`, it worked as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah)
[17:14:52] <Amir1>	 Here
[17:15:01] <Amir1>	 let me check
[17:15:36] <wikibugs>	 (03CR) 10Cwhite: prometheus: add additional metrics from logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[17:15:58] <Amir1>	 heartbeat was not cleaned up
[17:15:59] <swfrench-wmf>	 Amir1: thanks! let me know if you need more hands with anything
[17:16:02] <Amir1>	 one second
[17:16:07] <swfrench-wmf>	 ah, got it
[17:17:35] <swfrench-wmf>	 holding off on proceeding past testservers until this is clear
[17:18:29] <swfrench-wmf>	 !incidents
[17:18:30] <sirenbot>	 6574 (ACKED)  db2167 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:30] <sirenbot>	 6575 (UNACKED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:30] <sirenbot>	 6576 (UNACKED)  db2163 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:30] <sirenbot>	 6577 (UNACKED)  db2152 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:31] <sirenbot>	 6578 (UNACKED)  db2161 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:31] <sirenbot>	 6579 (UNACKED)  db2154 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:31] <sirenbot>	 6580 (UNACKED)  db2164 (paged)/MariaDB Replica Lag: s8 (paged)
[17:18:31] <sirenbot>	 6570 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[17:18:31] <sirenbot>	 6571 (RESOLVED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[17:18:55] <swfrench-wmf>	 !ack 6575 6576 6577 6578 6579 6580
[17:18:55] <sirenbot>	 Could not ack the alert. Please check the parameters.
[17:18:58] * Emperor here
[17:19:01] <swfrench-wmf>	 !ack 6575
[17:19:02] <sirenbot>	 6575 (ACKED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:03] <swfrench-wmf>	 !ack 6576
[17:19:04] <sirenbot>	 6576 (ACKED)  db2163 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:05] <swfrench-wmf>	 !ack 6577
[17:19:06] <sirenbot>	 6577 (ACKED)  db2152 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:07] <swfrench-wmf>	 !ack 6578
[17:19:08] <sirenbot>	 6578 (ACKED)  db2161 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:09] <swfrench-wmf>	 !ack 6579
[17:19:10] <sirenbot>	 6579 (ACKED)  db2154 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:12] <swfrench-wmf>	 !ack 6580
[17:19:13] <sirenbot>	 6580 (ACKED)  db2164 (paged)/MariaDB Replica Lag: s8 (paged)
[17:19:15] <Emperor>	 oncallers need any help? 
[17:19:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:19:25] <swfrench-wmf>	 there we go
[17:19:35] <Amir1>	 heartbeat on master of codfw was dead
[17:19:38] <Amir1>	 I restarted it 
[17:19:50] <Amir1>	 it fixed everything
[17:19:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2152 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:19:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2163 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:20:06] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2161 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:20:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:20:14] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2164 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:20:15] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2167 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:20:18] <Amir1>	 https://www.irccloud.com/pastebin/Jtpi3n8Y/
[17:20:25] <swfrench-wmf>	 Amir1: thank you very much! sounds like this is just something that was missed in the s8 primary switch in codfw?
[17:20:48] <Amir1>	 yeah
[17:20:50] <swfrench-wmf>	 (i.e., not a novel problem of some sort)
[17:20:54] <swfrench-wmf>	 cool, thank you
[17:21:05] <swfrench-wmf>	 !incidents
[17:21:05] <sirenbot>	 6580 (RESOLVED)  db2164 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:05] <sirenbot>	 6574 (RESOLVED)  db2167 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:06] <sirenbot>	 6579 (RESOLVED)  db2154 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:06] <sirenbot>	 6578 (RESOLVED)  db2161 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:06] <sirenbot>	 6576 (RESOLVED)  db2163 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:06] <sirenbot>	 6577 (RESOLVED)  db2152 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:06] <sirenbot>	 6575 (RESOLVED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[17:21:07] <sirenbot>	 6570 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[17:21:07] <sirenbot>	 6571 (RESOLVED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[17:21:28] <Emperor>	 Did those get past the on-callers or is VO paging everyone again?
[17:21:57] <icinga-wm>	 RECOVERY - Disk space on an-worker1145 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1145&var-datasource=eqiad+prometheus/ops
[17:22:02] <swfrench-wmf>	 there was a ~ 5m delay between icinga-wm and the page, so I think it might have just snuck past both oncallers
[17:22:36] <Emperor>	 hey ho
[17:25:01] * swfrench-wmf is going to get this backport going again
[17:25:04] <federico3>	 oh wow
[17:25:08] <logmsgbot>	 !log swfrench@deploy1003 swfrench, eevans: Continuing with sync
[17:26:16] <wikibugs>	 07sre-alert-triage, 06Data-Persistence: Alert in need of triage: ProbeDown (instance data-gateway-staging:30443) - https://phabricator.wikimedia.org/T399159#11079450 (10BTullis)
[17:30:39] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1171702|image-suggestion: reconfigure for data-gateway listener (T368096)]] (duration: 22m 28s)
[17:30:44] <stashbot>	 T368096: mediawiki: migrate from image-suggestion to data-gateway - https://phabricator.wikimedia.org/T368096
[17:37:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah)
[17:38:03] <wikibugs>	 (03CR) 10Majavah: [C:03+2] base: gen_fingerprints: Update sshd path for Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1178043 (https://phabricator.wikimedia.org/T393762) (owner: 10Majavah)
[17:38:06] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:07] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2164 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:16] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2167 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:50] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2163 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:38:53] <bblack>	 again?
[17:39:00] <swfrench-wmf>	 seriously
[17:39:53] <rzl>	 acked all
[17:39:59] <swfrench-wmf>	 thanks, rzl!
[17:40:37] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[17:41:11] <Amir1>	 Let me check
[17:41:22] <swfrench-wmf>	 pt-heartbeat-wikimedia.service is dead again on 2161?
[17:41:31] <rzl>	 from journalctl -fu pt-heartbeat-wikimedia on db2161 I see it started at 17:19:19 by Amir1 and then stopped at 17:27:52
[17:41:32] <rzl>	 yeah
[17:42:01] <rzl>	 hm, concurrent with a puppet run at 17:27:36
[17:42:08] <Amir1>	 was about to say
[17:42:10] <rzl>	 or at least suspiciously close
[17:42:12] <swfrench-wmf>	 I was just about to ask yeah
[17:42:14] <Amir1>	 this is missing puppet patch
[17:42:18] <Amir1>	 one second
[17:42:19] <swfrench-wmf>	 ^ that
[17:42:46] <Amir1>	 the switchover was not done properly
[17:42:52] <Amir1>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178022
[17:43:10] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713)
[17:43:14] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1178022 (https://phabricator.wikimedia.org/T401713) (owner: 10Gerrit maintenance bot)
[17:45:05] <Amir1>	 running puppet agent
[17:45:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2181 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1039.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:45:35] <rzl>	 !ack 6590
[17:45:35] <sirenbot>	 6590 (ACKED)  db2181 (paged)/MariaDB Replica Lag: s8 (paged)
[17:46:14] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2164 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:46:15] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2167 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:46:16] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2181 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:46:21] <wikibugs>	 (03CR) 10Muehlenhoff: admin: stop using groups parsoid-roots and parsoid-admin (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn)
[17:46:23] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:46:48] <swfrench-wmf>	 lovely
[17:46:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2152 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:46:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2163 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:47:07] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2154 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:47:08] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2161 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:47:08] <Amir1>	 the old primary is now not happy, probably because the row for heartbeat is now messed up
[17:47:22] <Amir1>	 but only one host
[17:47:26] <Amir1>	 https://orchestrator.wikimedia.org/web/cluster/alias/s8
[17:48:23] <Amir1>	 fixed
[17:48:31] <swfrench-wmf>	 Amir1: thank you very much once again!
[17:48:35] <rzl>	 Amir1: heroic, thank you
[17:48:43] <Amir1>	 <3 
[17:49:39] <denisse>	 Thanks Amir!!
[17:54:52] <Amir1>	 for future reference this step was missed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178022
[17:54:52] <Amir1>	 > Merge gerrit puppet change to promote NEW primary: FIXME
[17:54:52] <Amir1>	 https://phabricator.wikimedia.org/T401713
[18:00:04] <jouncebot>	 jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T1800)
[18:05:31] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375)
[18:05:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[18:06:24] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178054 (https://phabricator.wikimedia.org/T396375) (owner: 10TrainBranchBot)
[18:06:35] <wikibugs>	 (03PS1) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:07:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:07:41] <icinga-wm>	 RECOVERY - Disk space on an-worker1120 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops
[18:10:35] <wikibugs>	 (03PS2) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:11:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:11:13] <wikibugs>	 (03PS3) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:11:22] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:13:43] <logmsgbot>	 !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.14  refs T396375
[18:13:47] <stashbot>	 T396375: 1.45.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T396375
[18:13:56] <wikibugs>	 (03PS4) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:14:08] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:17:10] <wikibugs>	 (03PS5) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:17:14] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:19:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11079625 (10KFrancis) Thanks for checking in.  No further action is needed from you, @QChris.  We're waiting on legal counsel.  I just pinged him.
[18:20:31] <icinga-wm>	 RECOVERY - Disk space on an-worker1121 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops
[18:21:36] <wikibugs>	 (03PS6) 10Andrew Bogott: sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055
[18:21:41] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:23:33] <wikibugs>	 (03CR) 10Majavah: [C:03+1] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:23:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "thanks, Ahmon" [puppet] - 10https://gerrit.wikimedia.org/r/1177456 (https://phabricator.wikimedia.org/T303725) (owner: 10Dzahn)
[18:24:09] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[18:24:17] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:...
[18:25:32] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11079651 (10Dzahn)
[18:26:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] sssd: separate out sssd.conf file mode into its own block [puppet] - 10https://gerrit.wikimedia.org/r/1178055 (owner: 10Andrew Bogott)
[18:30:16] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11079683 (10Dzahn) deployed on Icinga and integration
[18:33:21] <icinga-wm>	 RECOVERY - Disk space on an-worker1118 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops
[18:34:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[18:34:21] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[18:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[18:37:59] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:41:59] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:47:24] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742)
[18:47:33] <wikibugs>	 (03Abandoned) 10Andrew Bogott: neutron metadata agent: restart service daily [puppet] - 10https://gerrit.wikimedia.org/r/1173388 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott)
[18:48:01] <wikibugs>	 (03PS2) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742)
[18:55:02] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11079827 (10ssingh) Hi folks: Thanks for confirming the extent of the changes from fr-tech's side. We discussed this a b...
[19:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[19:09:32] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:11:25] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:14:28] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003"
[19:14:41] <wikibugs>	 (03PS1) 10Cathal Mooney: Add INCLUDEs for Netbox-generated files for new codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240)
[19:14:44] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for nokia switches codfw - cmooney@cumin1003"
[19:14:44] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:25:12] <wikibugs>	 (03PS4) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300)
[19:25:16] <wikibugs>	 (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn)
[19:26:13] <icinga-wm>	 RECOVERY - Disk space on an-worker1129 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops
[19:27:20] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, compared each v4 and v6 against Netbox." [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240) (owner: 10Cathal Mooney)
[19:27:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add INCLUDEs for Netbox-generated files for new codfw subnets [dns] - 10https://gerrit.wikimedia.org/r/1178065 (https://phabricator.wikimedia.org/T380240) (owner: 10Cathal Mooney)
[19:28:07] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[19:28:54] <logmsgbot>	 andrew@cumin2002 reimage (PID 2607923) is awaiting input
[19:28:54] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[19:29:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Experimentation Lab: Grant Access to analytics-privatedata-users for EGardner-WMF - https://phabricator.wikimedia.org/T401622#11079936 (10egardner)
[19:32:03] <icinga-wm>	 RECOVERY - Disk space on an-worker1140 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1140&var-datasource=eqiad+prometheus/ops
[19:34:21] <wikibugs>	 (03PS1) 10Hashar: Add option to use the public hostname of a registry [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1178068 (https://phabricator.wikimedia.org/T401733)
[19:41:09] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[19:41:22] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:...
[19:49:38] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[19:50:23] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2213.codfw.wmnet with reason: Maintenance
[19:50:53] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott)
[19:51:07] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[19:51:35] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[19:53:56] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[19:57:00] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2205.codfw.wmnet with reason: Maintenance
[19:57:27] <wikibugs>	 (03PS1) 10Dzahn: zuul: create empty dir /var/lib/zuul on new zuul main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178069 (https://phabricator.wikimedia.org/T395938)
[19:57:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: create empty dir /var/lib/zuul on new zuul main hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178069 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[19:59:22] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] prometheus: add additional metrics from logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T2000).
[20:00:05] <jouncebot>	 dbrant and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <anzx>	 o/
[20:00:41] <dbrant>	 o/  I can self-deploy mine
[20:01:28] <wikibugs>	 (03PS3) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742)
[20:01:30] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott)
[20:02:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant)
[20:03:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add app_activity_tab event stream. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178007 (https://phabricator.wikimedia.org/T399630) (owner: 10Dbrant)
[20:04:01] <logmsgbot>	 !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]]
[20:04:06] <stashbot>	 T399630: Activity Tab: instrumentation - https://phabricator.wikimedia.org/T399630
[20:05:39] <cjming>	 anzx: do you need a deployer?
[20:05:52] <anzx>	 cjming: yes
[20:06:08] <logmsgbot>	 !log dbrant@deploy1003 dbrant: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:06:12] <cjming>	 dbrant: ping me when you're done and i can do the rest in the queue
[20:06:19] <dbrant>	 thx!
[20:06:37] <cjming>	 np!
[20:07:12] <logmsgbot>	 !log dbrant@deploy1003 dbrant: Continuing with sync
[20:11:58] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2165.codfw.wmnet with reason: Maintenance
[20:12:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T400854)', diff saved to https://phabricator.wikimedia.org/P81144 and previous config saved to /var/cache/conftool/dbconfig/20250812-201205-ladsgroup.json
[20:12:10] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[20:12:42] <logmsgbot>	 !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178007|Add app_activity_tab event stream. (T399630)]] (duration: 08m 41s)
[20:12:47] <stashbot>	 T399630: Activity Tab: instrumentation - https://phabricator.wikimedia.org/T399630
[20:12:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:13:20] <dbrant>	 cjming  done!
[20:13:31] <cjming>	 great - thanks!
[20:13:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T400854)', diff saved to https://phabricator.wikimedia.org/P81146 and previous config saved to /var/cache/conftool/dbconfig/20250812-201358-ladsgroup.json
[20:14:23] <wikibugs>	 (03PS2) 10Anzx: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767)
[20:15:15] <cwhite>	 !log remove thanos-query.discovery.wmnet old puppet cert - T401671
[20:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:19] <stashbot>	 T401671: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T401671
[20:15:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[20:16:33] <wikibugs>	 (03Merged) 10jenkins-bot: madwikisource: add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178023 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[20:16:55] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]]
[20:17:00] <stashbot>	 T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767
[20:17:47] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[20:17:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81147 and previous config saved to /var/cache/conftool/dbconfig/20250812-201754-ladsgroup.json
[20:17:59] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[20:19:05] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:19:29] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] prometheus: add additional metrics from logs [puppet] - 10https://gerrit.wikimedia.org/r/1175569 (owner: 10Cwhite)
[20:19:44] <anzx>	 cjming: look good
[20:19:54] <cjming>	 cool - syncing
[20:19:57] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Continuing with sync
[20:20:27] <wikibugs>	 (03PS2) 10Anzx: zghwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785)
[20:22:45] <icinga-wm>	 RECOVERY - Disk space on an-worker1133 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1133&var-datasource=eqiad+prometheus/ops
[20:22:48] <jinxer-wm>	 RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:23:10] <jinxer-wm>	 RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:24:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81148 and previous config saved to /var/cache/conftool/dbconfig/20250812-202437-ladsgroup.json
[20:24:43] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[20:25:16] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178023|madwikisource: add logo (T391767)]] (duration: 08m 21s)
[20:25:20] <stashbot>	 T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767
[20:26:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[20:26:53] <wikibugs>	 (03Merged) 10jenkins-bot: zghwiktionary: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178027 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[20:27:16] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]]
[20:27:20] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[20:29:25] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:29:37] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938)
[20:29:57] <wikibugs>	 (03PS5) 10Anzx: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767)
[20:30:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[20:30:23] <anzx>	 cjming: looks good 
[20:30:35] <cjming>	 nice
[20:30:43] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Continuing with sync
[20:31:16] <wikibugs>	 (03PS2) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938)
[20:31:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[20:35:58] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178027|zghwiktionary: add logos (T399785)]] (duration: 08m 42s)
[20:36:03] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[20:36:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[20:37:39] <wikibugs>	 (03Merged) 10jenkins-bot: madwikisource: set metanamespace, sitename and timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1177936 (https://phabricator.wikimedia.org/T391767) (owner: 10Anzx)
[20:38:00] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]]
[20:38:04] <stashbot>	 T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767
[20:39:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81149 and previous config saved to /var/cache/conftool/dbconfig/20250812-203945-ladsgroup.json
[20:40:05] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:41:47] <cjming>	 anzx: ^^?
[20:42:31] <anzx>	 cjming: ok to proceed 
[20:42:42] <cjming>	 great
[20:42:45] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Continuing with sync
[20:45:59] <logmsgbot>	 andrew@cumin2002 reimage (PID 2644504) is awaiting input
[20:47:06] <wikibugs>	 (03PS4) 10Andrew Bogott: neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742)
[20:48:02] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1177936|madwikisource: set metanamespace, sitename and timezone (T391767)]] (duration: 10m 02s)
[20:48:07] <stashbot>	 T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767
[20:48:38] <wikibugs>	 (03PS4) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785)
[20:48:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[20:49:22] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott)
[20:50:28] <cjming>	 anzx: can you fix last patch?
[20:50:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] neutron metadata agent: remove service restart [puppet] - 10https://gerrit.wikimedia.org/r/1178062 (https://phabricator.wikimedia.org/T395742) (owner: 10Andrew Bogott)
[20:54:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P81150 and previous config saved to /var/cache/conftool/dbconfig/20250812-205453-ladsgroup.json
[20:55:12] <wikibugs>	 (03PS5) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785)
[20:55:54] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[20:57:33] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad
[20:58:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:58:31] <anzx>	 cjming: fixed
[20:58:50] <wikibugs>	 (03PS6) 10Anzx: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785)
[20:59:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:59:22] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[20:59:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[20:59:41] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250812T2100)
[21:00:25] <wikibugs>	 (03Merged) 10jenkins-bot: minwikibooks , zghwiktionary : add project talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178000 (https://phabricator.wikimedia.org/T399785) (owner: 10Anzx)
[21:00:48] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]]
[21:00:54] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[21:00:54] <stashbot>	 T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499
[21:02:54] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:03:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:03:33] <cjming>	 anzx: lmk ^
[21:04:38] <anzx>	 cjming: works fine
[21:05:44] <logmsgbot>	 !log cjming@deploy1003 cjming, anzx: Continuing with sync
[21:10:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T400854)', diff saved to https://phabricator.wikimedia.org/P81151 and previous config saved to /var/cache/conftool/dbconfig/20250812-211001-ladsgroup.json
[21:10:06] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[21:10:16] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[21:10:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81152 and previous config saved to /var/cache/conftool/dbconfig/20250812-211023-ladsgroup.json
[21:10:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:11:19] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178000|minwikibooks , zghwiktionary : add project talk namespace aliases (T399785 T395499)]] (duration: 10m 31s)
[21:11:24] <stashbot>	 T399785: Post-creation work for zghwiktionary - https://phabricator.wikimedia.org/T399785
[21:11:25] <stashbot>	 T395499: Post-creation work for minwikibooks - https://phabricator.wikimedia.org/T395499
[21:11:25] <anzx>	 cjming: please run namespace dupes  https://www.irccloud.com/pastebin/0s0zBow1/
[21:11:32] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] mediawiki-global: set sre as receiver of MediaWikiElevatedUnknownLogins [alerts] - 10https://gerrit.wikimedia.org/r/1175578 (https://phabricator.wikimedia.org/T395117) (owner: 10Cwhite)
[21:12:16] <cjming>	 anzx: also for minwikibooks right?
[21:12:28] <rzl>	 cjming: hi, looks like the high 5xx rate might be associated with one of those patches, can you check?
[21:12:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1076 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:32] <anzx>	 `mwscript-k8s --comment=T395499 --follow -- namespaceDupes minwikibooks --fix --add-prefix=T399785 | tee ~/T395499` for minwikibooks
[21:12:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:35] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1098 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:37] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:41] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:43] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:43] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:49] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:53] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:55] <cjming>	 rzl: oh - shoot
[21:12:58] <ryankemper>	 ^ expected, but there should be a maintenance window set. fixing
[21:12:59] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:01] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81153 and previous config saved to /var/cache/conftool/dbconfig/20250812-211303-ladsgroup.json
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1100 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1118 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:07] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:09] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:09] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1074 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:10] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:10] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:11] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:11] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:12] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:12] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:13] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:14] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:14] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:15] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:15] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:16] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1081 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:16] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:17] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:18] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:18] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1119 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:20] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1102 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:21] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1075 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:22] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:23] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:13:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-global: set sre as receiver of MediaWikiElevatedUnknownLogins [alerts] - 10https://gerrit.wikimedia.org/r/1175578 (https://phabricator.wikimedia.org/T395117) (owner: 10Cwhite)
[21:13:42] <ryankemper>	 sorry for the spam, downtime going up now :)
[21:13:46] <rzl>	 for clarity, the search spam is expected and is unrelated, the MW 5xx alert is genuine
[21:13:50] <swfrench-wmf>	 rzl: https://logstash.wikimedia.org/goto/43235ff51b952d7b80590b895fd761ff - seems to all be s8 / wikidatawiki
[21:14:07] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:14:21] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:...
[21:14:28] <cjming>	 rzl: not sure what to do for the MW 5xx alerts
[21:14:30] <rzl>	 swfrench-wmf: hm, nice
[21:14:50] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 55 hosts with reason: investigate cluster quorum failure
[21:14:51] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:14:55] <cjming>	 am i ok to run one more maintenance script?
[21:15:09] <rzl>	 cjming: did you roll anything out around 20:50 to 20:55, and if so, can you roll it back please? :)
[21:16:30] <cjming>	 rzl: i started one deployment at 20:36 ending 20:48, and last one 20:59 ending 21:11
[21:16:46] <rzl>	 first one is the suspect
[21:16:53] <rzl>	 just based on timing
[21:17:16] <rzl>	 looking at the patch I don't see anything likely to cause this, especially with the s8 connection swfrench-wmf points out, but as long as it's easy to rollback and rule it out, let's do that now please
[21:17:23] <cjming>	 ok - so revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1177936?
[21:17:31] <cjming>	 fyi anzx ^^
[21:18:33] <cjming>	 does that mean i should also rollback #415?  1177936 correlates to #414
[21:19:25] <swfrench-wmf>	 rzl: do you happen to know if there's anything special around wikidata using search? the reason I ask is this maybe correlates with when search-eqiad was depooled https://sal.toolforge.org/log/tNASoJgBffdvpiTr87jw
[21:19:33] <rzl>	 cjming: as the deployer I need you to make that decision, or to escalate to releng if you need help :)
[21:20:05] <rzl>	 swfrench-wmf: hm, the errors started rising before that specific log line, but it's plausible
[21:20:11] <cjming>	 ok - so then i will revert the last 2 deployments in this order: #415, then #414
[21:20:22] <rzl>	 ryankemper: can you weigh in on swfrench-wmf's question?
[21:20:27] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:20:34] <rzl>	 inflatador_ too ^
[21:20:44] <cjming>	 should i proceed?
[21:20:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:20:59] <rzl>	 cjming: yes
[21:21:08] <inflatador_>	 swfrench-wmf rzl :eyes
[21:21:32] <ryankemper>	 trying to see where the 5xx are coming from
[21:21:33] <wikibugs>	 (03PS1) 10Clare Ming: Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080
[21:21:59] <wikibugs>	 (03CR) 10Anzx: [C:03+1] Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming)
[21:22:35] <ryankemper>	 when we depooled the cluster there shouldn't be immediate impact. at the moment the eqiad cluster was restarted there was very little remaining threadpool activity on opensearch 
[21:22:40] <cjming>	 can i revert both at same time or should i do one at a time?
[21:22:49] <ryankemper>	 so I wouldn't expect many mw errors as a result
[21:23:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming)
[21:23:25] <ryankemper>	 I'm a bit confused about the wikidata part of the question though. I might be missing some context from the backscroll
[21:23:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2206.codfw.wmnet with reason: Maintenance
[21:23:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T399249)', diff saved to https://phabricator.wikimedia.org/P81154 and previous config saved to /var/cache/conftool/dbconfig/20250812-212344-fceratto.json
[21:23:48] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[21:24:17] <rzl>	 ryankemper: so, the errors we're seeing are all in wikidata, https://logstash.wikimedia.org/goto/43235ff51b952d7b80590b895fd761ff
[21:24:18] <swfrench-wmf>	 ryankemper: inflatador_: onset of errors seems to be closer to 20:52, so unless anything wend sideways around then on the search side of things (i.e., a couple minutes before the depool), then it's probably nothing
[21:24:32] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "minwikibooks , zghwiktionary : add project talk namespace aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178080 (owner: 10Clare Ming)
[21:24:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[21:24:54] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]]
[21:25:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[21:25:13] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10161, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1084 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[21:25:13] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10166, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:13] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[21:25:14] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10384, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:14] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1102 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 57,
[21:25:15] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10417, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:15] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1075 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58,
[21:25:16] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10589, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:17] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_t
[21:25:18] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10676, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:18] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58,
[21:25:19] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10660, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:19] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 58,
[21:25:20] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10692, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:25:20] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 60,
[21:25:21] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11739, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:16] <thcipriani>	 huh, I wonder why that would cause this spike
[21:26:35] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1083 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3763, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 625, delayed_unassigned_shards: 0, number_of_pending
[21:26:35] <icinga-wm>	 51, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 68148, active_shards_percent_as_number: 85.13574660633483 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:35] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1112 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3826, relocating_shards: 0, initializing_shards: 43, unassigned_shards: 551, delayed_unassigned_shards: 0, number_of_pending
[21:26:35] <icinga-wm>	 54, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 69195, active_shards_percent_as_number: 86.56108597285068 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:39] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1113 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 3986, relocating_shards: 0, initializing_shards: 39, unassigned_shards: 395, delayed_unassigned_shards: 0, number_of_pending
[21:26:39] <icinga-wm>	 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 72671, active_shards_percent_as_number: 90.18099547511312 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:41] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 54, number_of_data_nodes: 54, discovered_master: True, active_primary_shards: 1469, active_shards: 4014, relocating_shards: 0, initializing_shards: 52, unassigned_shards: 354, delayed_unassigned_shards: 0, number_of_pending
[21:26:41] <icinga-wm>	 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 73899, active_shards_percent_as_number: 90.81447963800905 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1089 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4097, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 273, delayed_unassigned_shards: 0, number_of_pending
[21:26:45] <icinga-wm>	 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 79167, active_shards_percent_as_number: 92.6923076923077 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:45] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1110 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4097, relocating_shards: 0, initializing_shards: 50, unassigned_shards: 273, delayed_unassigned_shards: 0, number_of_pending
[21:26:45] <icinga-wm>	 21, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 79187, active_shards_percent_as_number: 92.6923076923077 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:46] <ryankemper>	 rzl: swfrench-wmf: ah let me inspect that logstash a bit
[21:26:51] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4125, relocating_shards: 0, initializing_shards: 31, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending
[21:26:51] <icinga-wm>	 7, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 85068, active_shards_percent_as_number: 93.32579185520362 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:53] <thcipriani>	 I wonder if this is a coincidence and this is traffic
[21:26:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1124 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4138, relocating_shards: 0, initializing_shards: 18, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending
[21:26:53] <icinga-wm>	 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86269, active_shards_percent_as_number: 93.61990950226244 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:53] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1097 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1469, active_shards: 4138, relocating_shards: 0, initializing_shards: 18, unassigned_shards: 264, delayed_unassigned_shards: 0, number_of_pending
[21:26:53] <icinga-wm>	 8, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 86273, active_shards_percent_as_number: 93.61990950226244 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:57] <rzl>	 I'm more and more convinced the errors are unrelated to both the backport and the search errors, but I'd still like to rule them out conclusively
[21:26:57] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4179, relocating_shards: 0, initializing_shards: 12, unassigned_shards: 229, delayed_unassigned_shards: 0, number_of_pending
[21:26:57] <icinga-wm>	 29, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 91366, active_shards_percent_as_number: 94.5475113122172 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1111 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending
[21:26:59] <icinga-wm>	 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 92161, active_shards_percent_as_number: 95.2262443438914 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1117 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending
[21:26:59] <icinga-wm>	 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 92181, active_shards_percent_as_number: 95.2262443438914 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:26:59] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1093 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1471, active_shards: 4209, relocating_shards: 0, initializing_shards: 22, unassigned_shards: 189, delayed_unassigned_shards: 0, number_of_pending
[21:27:00] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:27:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:23] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1076 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:23] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:23] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1120 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:24] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with sync
[21:27:24] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:25] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:25] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1070 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:27] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:27] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1098 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:27] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:29] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1116 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:30] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:33] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1109 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1472, active_shards: 4371, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 36, delayed_unassigned_shards: 0, number_of_pending_
[21:27:33] <icinga-wm>	 , number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 98.89140271493213 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:27:39] <swfrench-wmf>	 rzl: it's shellbox-constraints: https://grafana.wikimedia.org/goto/u-CMub_Ng?orgId=1
[21:27:44] <swfrench-wmf>	 it's wildly overloaded
[21:28:07] <rzl>	 okay that tracks with the /wiki/Special:ConstraintReport/Qnnnnnn urls
[21:28:10] <swfrench-wmf>	 something started around 16:00 and only now has it hit the fan
[21:28:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81155 and previous config saved to /var/cache/conftool/dbconfig/20250812-212811-ladsgroup.json
[21:28:53] <rzl>	 you're right about that delay though, interesting
[21:29:13] <cjming>	 presumably then the deployments that just happened are not the culprit?
[21:29:23] <cjming>	 should i continue reverting?
[21:29:24] <thcipriani>	 I think the spike of 500s is related to this as well, FWIW.
[21:29:53] <thcipriani>	 cjming: let's keep reverting, just to ensure it's not something non-obvious, I'd say.
[21:30:06] <cjming>	 alrighty
[21:30:10] <swfrench-wmf>	 rzl: shall I try throwing pods at the problem as a short-term mitigation?
[21:30:23] <wikibugs>	 (03PS1) 10Clare Ming: Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081
[21:30:24] <swfrench-wmf>	 (while we sort out the source of traffic)
[21:30:39] <wikibugs>	 (03PS1) 10Dzahn: create zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938)
[21:30:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ...
[21:30:50] <jinxer-wm>	 CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow
[21:30:53] <rzl>	 swfrench-wmf: yeah, go for it
[21:30:53] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ...
[21:30:59] <jinxer-wm>	 CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow
[21:31:03] <rzl>	 belatedly I think I'm the IC :) I'll start a doc shortly
[21:31:06] <wikibugs>	 (03PS2) 10Dzahn: create zuul.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1178082 (https://phabricator.wikimedia.org/T395938)
[21:31:16] <ryankemper>	 rzl: <3
[21:31:28] <sbassett>	 rzl et al - so I should hold off on the security deployment I wanted to do real quick...
[21:31:35] <rzl>	 sbassett: yes please
[21:31:36] <wikibugs>	 (03CR) 10Anzx: [C:03+1] Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming)
[21:31:47] <sbassett>	 understood
[21:32:05] <rzl>	 once the incident's over, I think cjming still has the floor but I'll let you both know
[21:32:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[21:32:28] <logmsgbot>	 !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[21:32:30] <rzl>	 https://trace.wikimedia.org/trace/221540befbfac471f5856d221f9b6675
[21:32:42] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178080|Revert "minwikibooks , zghwiktionary : add project talk namespace aliases"]] (duration: 07m 48s)
[21:32:42] <rzl>	 ^ jaeger trace for a slow Special:ConstraintReport request, we're living in the future
[21:33:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming)
[21:33:30] <wikibugs>	 (03PS1) 10Andrew Bogott: neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083
[21:34:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "madwikisource: set metanamespace, sitename and timezone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178081 (owner: 10Clare Ming)
[21:34:12] <swfrench-wmf>	 rzl: doubled shellbox-constraints in codfw to 20 replicas. this is a dirty edit for now, as I'm not quite sure what the right size might be
[21:34:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 9.967% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:34:26] <logmsgbot>	 !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]]
[21:34:38] <rzl>	 swfrench-wmf: you might be interested in that jaeger trace, a bunch of fast shellbox calls and a handful of very slow ones
[21:34:44] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[21:34:59] <rzl>	 we don't have a bunch of healthy pods and one very sad one, do we?
[21:35:44] <rzl>	 hm, doesn't seem like it
[21:35:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ...
[21:35:45] <jinxer-wm>	 CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow
[21:35:48] <wikibugs>	 (03PS2) 10Andrew Bogott: neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083
[21:36:02] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ...
[21:36:08] <jinxer-wm>	 CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow
[21:36:13] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[21:36:16] <swfrench-wmf>	 rzl: I'm going to boop you in another channel
[21:36:20] <rzl>	 👍
[21:36:31] <logmsgbot>	 !log cjming@deploy1003 cjming: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:37:00] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: create a map for zuul.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1178084
[21:37:02] <logmsgbot>	 !log cjming@deploy1003 cjming: Continuing with sync
[21:38:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[21:38:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] neutron metadata agent: remove service restart timer [puppet] - 10https://gerrit.wikimedia.org/r/1178083 (owner: 10Andrew Bogott)
[21:42:17] <logmsgbot>	 !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178081|Revert "madwikisource: set metanamespace, sitename and timezone"]] (duration: 07m 51s)
[21:42:41] <cjming>	 reverts of deployments #415 and #414 are done - i guess maybe that was it?
[21:43:02] <rzl>	 cjming: looks like we're satisfied this was traffic-related; thanks for rolling back even though it turned out not to be needed <3 please don't roll anything forward yet, let you know soon
[21:43:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P81156 and previous config saved to /var/cache/conftool/dbconfig/20250812-214318-ladsgroup.json
[21:43:30] <cjming>	 rzl: sounds good
[21:43:45] <cjming>	 anzx: sorry about that
[21:44:29] <anzx>	 cjming: no worries, will probably schedule it for tomorrow, thanks for deploying 
[21:45:28] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1042.eqiad.wmnet with OS bullseye
[21:52:33] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-kfbzh:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[21:54:42] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad
[21:58:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T400854)', diff saved to https://phabricator.wikimedia.org/P81157 and previous config saved to /var/cache/conftool/dbconfig/20250812-215826-ladsgroup.json
[21:58:31] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[21:58:42] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[21:58:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81158 and previous config saved to /var/cache/conftool/dbconfig/20250812-215849-ladsgroup.json
[22:00:05] <rzl>	 cjming, sbassett: okay, we're just cleaning up a little from the incident but SRE's comfortable with resuming deployments now, thanks a lot for your patience
[22:01:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81159 and previous config saved to /var/cache/conftool/dbconfig/20250812-220132-ladsgroup.json
[22:08:30] <rzl>	 cjming, sbassett: I'll leave the coordination between you -- cjming if you want to roll forward anzx's changes after all, no objections from me
[22:09:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:09:31] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:10:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:10:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:16:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81160 and previous config saved to /var/cache/conftool/dbconfig/20250812-221639-ladsgroup.json
[22:17:33] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-kfbzh:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[22:24:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:30:22] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#11080632 (10Dzahn)
[22:31:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P81161 and previous config saved to /var/cache/conftool/dbconfig/20250812-223147-ladsgroup.json
[22:36:38] <jinxer-wm>	 FIRING: GnmiTargetDown: lsw1-d2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[22:37:27] <wikibugs>	 (03PS1) 10Dzahn: zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093
[22:38:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] zuul::main: allow caching layer to connect to http backend [puppet] - 10https://gerrit.wikimedia.org/r/1178093 (owner: 10Dzahn)
[22:38:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[22:39:50] <wikibugs>	 (03Merged) 10jenkins-bot: add zoekt from upstream and blubber builder config to build it [container/codesearch] - 10https://gerrit.wikimedia.org/r/1172390 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[22:40:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:40:46] <wikibugs>	 (03PS1) 10RLazarus: shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094
[22:46:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T400854)', diff saved to https://phabricator.wikimedia.org/P81162 and previous config saved to /var/cache/conftool/dbconfig/20250812-224655-ladsgroup.json
[22:47:05] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[22:47:10] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1195.eqiad.wmnet with reason: Maintenance
[22:47:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81163 and previous config saved to /var/cache/conftool/dbconfig/20250812-224717-ladsgroup.json
[22:47:44] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus)
[22:48:10] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus)
[22:48:17] <wikibugs>	 (03CR) 10Dzahn: gerrit: add daemons ssh host key to known_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar)
[22:49:44] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: Bump replicas from 10 to 20 for traffic increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178094 (owner: 10RLazarus)
[22:50:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81164 and previous config saved to /var/cache/conftool/dbconfig/20250812-225001-ladsgroup.json
[22:51:39] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[22:51:57] <logmsgbot>	 !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[22:52:30] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11080722 (10bd808)
[23:01:21] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:03:00] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1178073/6569/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[23:04:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[23:05:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81165 and previous config saved to /var/cache/conftool/dbconfig/20250812-230508-ladsgroup.json
[23:09:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:20:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P81166 and previous config saved to /var/cache/conftool/dbconfig/20250812-232016-ladsgroup.json
[23:21:00] <wikibugs>	 (03PS3) 10Dzahn: zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938)
[23:25:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul::main: create /var/lib/zuul/.ssh/known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1178073 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn)
[23:35:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T400854)', diff saved to https://phabricator.wikimedia.org/P81167 and previous config saved to /var/cache/conftool/dbconfig/20250812-233524-ladsgroup.json
[23:35:29] <stashbot>	 T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854
[23:35:40] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[23:35:58] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[23:36:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T400854)', diff saved to https://phabricator.wikimedia.org/P81168 and previous config saved to /var/cache/conftool/dbconfig/20250812-233605-ladsgroup.json
[23:38:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103
[23:38:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103 (owner: 10TrainBranchBot)
[23:38:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T400854)', diff saved to https://phabricator.wikimedia.org/P81169 and previous config saved to /var/cache/conftool/dbconfig/20250812-233843-ladsgroup.json
[23:45:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and KPN (139.156.127.121) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:52:58] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1178103 (owner: 10TrainBranchBot)
[23:53:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P81170 and previous config saved to /var/cache/conftool/dbconfig/20250812-235351-ladsgroup.json