[00:06:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33775 and previous config saved to /var/cache/conftool/dbconfig/20220905-000606-ladsgroup.json
[00:13:24] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P33776 and previous config saved to /var/cache/conftool/dbconfig/20220905-002112-ladsgroup.json
[00:36:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T312863)', diff saved to https://phabricator.wikimedia.org/P33777 and previous config saved to /var/cache/conftool/dbconfig/20220905-003619-ladsgroup.json
[00:36:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[00:36:22] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[00:36:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[01:11:54] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[01:14:22] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[01:21:44] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:30:06] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:06] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:05:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:04] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[02:20:32] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 5 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[02:40:16] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33778 and previous config saved to /var/cache/conftool/dbconfig/20220905-024602-ladsgroup.json
[02:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[02:46:05] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[02:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance
[03:27:10] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:36] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:51:18] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:01:58] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:13:33] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:16:26] <icinga-wm>	 PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:18:28] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 8 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:42:48] <wikibugs>	 (03PS7) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[04:58:04] <wikibugs>	 (03PS1) 10Stang: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004)
[04:59:43] <wikibugs>	 (03PS1) 10Stang: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004)
[05:02:42] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:10:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:17:48] <icinga-wm>	 RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:38:24] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Multi-DC: go back to testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/828677 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling)
[05:53:54] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042)
[05:55:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[05:56:49] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736)
[06:07:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[06:08:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[06:11:38] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:13:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: canary_appserver: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829550 (https://phabricator.wikimedia.org/T271736)
[06:13:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: convert to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829551 (https://phabricator.wikimedia.org/T271736)
[06:13:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: api appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829552 (https://phabricator.wikimedia.org/T271736)
[06:14:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: appserver: convert to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829553
[06:28:24] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr2-eqiad:xe-4/1/3
[06:28:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr2-eqiad:xe-4/1/3
[06:30:48] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) From Lumen diagnostic tool: >  SERVICE ALARMS NEEDS ATTENTION We have detected equipment alarms. Further Investigation is required.   > Ticket ID...
[06:44:42] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T317009 - The acknowledgement expires at: 2022-09-06 06:44:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:44:42] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi https://phabricator.wikimedia.org/T317009 - The acknowledgement expires at: 2022-09-06 06:44:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:09] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042)
[06:50:38] <wikibugs>	 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10Ladsgroup) One random note: Can dumps migrate to apache from nginx? To standardize our infra so I don't look for apache logs in hurry in a Sunday.
[06:54:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] webperf: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829121 (owner: 10Muehlenhoff)
[06:55:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[06:55:26] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042)
[06:55:28] <wikibugs>	 (03PS3) 10Muehlenhoff: releases: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829202 (https://phabricator.wikimedia.org/T308013)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T0700)
[07:00:05] <jouncebot>	 koi and _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:20] <koi>	 o/
[07:00:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37110/console" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[07:00:34] <urbanecm>	 o/
[07:00:35] <Amir1>	 Joe can self-serve
[07:01:05] <_joe_>	 Amir1: actually, I'd like to be served :D
[07:01:15] <_joe_>	 jokes aside, should I just go now or wait?
[07:01:23] <Amir1>	 https://deploy-commands.toolforge.org/bacc/823678
[07:01:33] <Amir1>	 _joe_: you go first, it's important 
[07:01:39] <urbanecm>	 _joe_: deploy your patch first, I'll deploy koi's after? :)
[07:01:44] <_joe_>	 ack
[07:02:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[07:03:04] <wikibugs>	 (03Merged) 10jenkins-bot: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[07:04:05] <_joe_>	 syncing
[07:07:10] <Amir1>	 once you all are done, ping me
[07:07:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:07:48] <logmsgbot>	 !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823678|Move 10% of traffic to php 7.4 (T271736)]] (duration: 03m 50s)
[07:07:50] <stashbot>	 T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736
[07:08:26] <wikibugs>	 (03PS1) 10Ladsgroup: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673)
[07:10:13] <urbanecm>	 _joe_: are you done with your patch?
[07:10:21] <_joe_>	 urbanecm: yes
[07:10:25] <_joe_>	 sorry, the log was enough
[07:10:30] <_joe_>	 it's a simple config knob
[07:10:38] <urbanecm>	 I wasn't sure if there's any follow-up or anything :)
[07:10:47] <urbanecm>	 Thanks, going ahead with koi's patches now. 
[07:10:49] <_joe_>	 and tbh, I don't expect great changes until we switch 100% of the traffic over :)
[07:10:54] <_joe_>	 yes, sorry koi for the wait
[07:12:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:12:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:14:00] <wikibugs>	 (03PS2) 10Urbanecm: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:14:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:14:16] <wikibugs>	 (03PS2) 10Urbanecm: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:14:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:14:53] <wikibugs>	 (03Merged) 10jenkins-bot: Upload missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829329 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:15:15] <wikibugs>	 (03Merged) 10jenkins-bot: Fix missing logo for mniwiktionary and frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829330 (https://phabricator.wikimedia.org/T317004) (owner: 10Stang)
[07:15:32] <urbanecm>	 koi: your patch is at mwdebug1001
[07:15:38] <koi>	 looking
[07:15:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:16:52] <koi>	 urbanecm: both logo on these two sites LGTM
[07:17:21] <urbanecm>	 great, syncing
[07:19:35] <moritzm>	 !log installing ghostscript security updates
[07:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:22:02] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized static/images/project-logos/: ff2e1082d8b3fe0ba93cd37a1b516dece84a834b: Upload missing logo for mniwiktionary and frwikiquote (T317004) (duration: 03m 50s)
[07:22:04] <stashbot>	 T317004: Missing logo for mniwiktionary and frwikiquote - https://phabricator.wikimedia.org/T317004
[07:25:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:25:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:25:39] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 739920ceb09358a2ea89d82494522876fffd2621: Fix missing logo for mniwiktionary and frwikiquote (T317004) (duration: 03m 36s)
[07:25:46] <urbanecm>	 koi: should be live now
[07:25:48] <wikibugs>	 (03PS1) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[07:26:04] <urbanecm>	 Amir1: over to you :)
[07:26:15] <Amir1>	 awesome
[07:26:30] <wikibugs>	 (03PS2) 10Ladsgroup: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673)
[07:26:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[07:27:34] <wikibugs>	 (03Merged) 10jenkins-bot: Make English Wikipedia read new on templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829556 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[07:29:09] <wikibugs>	 (03CR) 10Elukey: Add a helmfile configuration for the dse-k8s-eqiad cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/826836 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis)
[07:29:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:32:52] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:829556|Make English Wikipedia read new on templatelinks migration (T306673)]] (duration: 03m 31s)
[07:32:56] <stashbot>	 T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673
[07:34:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:38:14] <wikibugs>	 (03PS1) 10Ayounsi: Rename Telia to Arelion [homer/public] - 10https://gerrit.wikimedia.org/r/829558
[07:38:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:38:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:42:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:44:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:48:24] <wikibugs>	 (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:51:38] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:52:22] <wikibugs>	 (03PS1) 10Slyngshede: P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673)
[07:52:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:57:21] <wikibugs>	 (03PS8) 10Stang: logos: Cover wordmark/tagline in manage.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829298 (https://phabricator.wikimedia.org/T307705)
[07:57:23] <wikibugs>	 (03PS1) 10Stang: Replace wordmark/tagline with correct naming style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829561 (https://phabricator.wikimedia.org/T307705)
[07:58:05] <wikibugs>	 (03PS2) 10Slyngshede: P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673)
[08:01:37] <XioNoX>	 !log rename Telia to Arelion in Netbox
[08:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:52] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865)
[08:03:25] <wikibugs>	 (03PS1) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705)
[08:04:51] <wikibugs>	 (03PS2) 10Ladsgroup: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865)
[08:05:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:06:07] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37111/console" [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:07:00] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[08:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to old templatelinks fields in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829562 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup)
[08:08:37] <wikibugs>	 (03PS2) 10David Caro: dynamicproxy: add simple compile test [puppet] - 10https://gerrit.wikimedia.org/r/826299
[08:09:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro)
[08:10:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37112/console" [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:12:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:13:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[08:13:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[08:14:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[08:14:11] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:829562|Stop writing to old templatelinks fields in s7 (T312865)]] (duration: 03m 51s)
[08:14:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[08:14:15] <stashbot>	 T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865
[08:14:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[08:14:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[08:14:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[08:15:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[08:15:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:15:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:15:40] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Safe and small enough, merging." [puppet] - 10https://gerrit.wikimedia.org/r/826299 (owner: 10David Caro)
[08:17:39] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826986 (https://phabricator.wikimedia.org/T316463) (owner: 10Majavah)
[08:18:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:18:28] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro)
[08:19:52] <wikibugs>	 (03PS2) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[08:20:14] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge::grid: remove legacy host key stuff [puppet] - 10https://gerrit.wikimedia.org/r/829305 (owner: 10Majavah)
[08:21:19] <wikibugs>	 (03CR) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:21:28] <wikibugs>	 (03PS3) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[08:22:00] <wikibugs>	 (03Merged) 10jenkins-bot: tox: add py310 [alerts] - 10https://gerrit.wikimedia.org/r/829111 (owner: 10David Caro)
[08:24:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:27:26] <wikibugs>	 10SRE: Degraded RAID on cp4021 - https://phabricator.wikimedia.org/T293225 (10Volans) 05Open→03Resolved a:03Volans Obsolete, resolving.
[08:28:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: output thanos-query syslogs to kafka and local file [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) (owner: 10Herron)
[08:30:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert)
[08:30:31] <wikibugs>	 (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:31:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[08:32:17] <wikibugs>	 (03PS3) 10Muehlenhoff: udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013)
[08:33:00] <wikibugs>	 (03PS2) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705)
[08:35:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] udp2log: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829203 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:37:14] <godog>	 jouncebot: next
[08:37:14] <jouncebot>	 In 4 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1300)
[08:39:39] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet
[08:40:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "preview available here: https://grafana.wikimedia.org/dashboard/snapshot/8g27u7vLB6Hlc0EA0FK3zRmzf7WaivB3?orgId=1" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[08:41:30] <wikibugs>	 (03PS4) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[08:41:34] <icinga-wm>	 PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:42:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan)
[08:43:06] <wikibugs>	 (03PS5) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[08:43:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10fgiunchedi) Thank you for following up, I think the culprit is the fact that the S3 compat API stores chunk...
[08:44:04] <wikibugs>	 (03CR) 10Muehlenhoff: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:48:48] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet
[08:48:57] <wikibugs>	 10SRE, 10Observability-Metrics: Not all carbon service start at graphite reboot - https://phabricator.wikimedia.org/T316747 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi
[08:49:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:50:49] <wikibugs>	 (03PS3) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294)
[08:51:04] <wikibugs>	 (03CR) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[08:51:22] <wikibugs>	 (03PS3) 10Muehlenhoff: k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013)
[08:53:04] <wikibugs>	 (03PS6) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673)
[08:53:53] <wikibugs>	 (03CR) 10Slyngshede: Systemd timer: Cleanup a few dangling absent cronjob references. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:54:29] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:openldap::management rename variable. [puppet] - 10https://gerrit.wikimedia.org/r/829559 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:55:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1006.eqiad.wmnet
[08:57:02] <icinga-wm>	 PROBLEM - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: Text\sis\savailable\sunder\sthe a\shref=\/\/creativecommons\.org\/licenses\/by-sa\/3\.0\/Creative\sCommons\sAttribution-ShareAlike\sLicense./a: additional\sterms\smay\sapply\. html not found https://phabricator.wikimedia.org/project/members/28/
[08:59:01] <wikibugs>	 (03Abandoned) 10Jon Harald Søby: Remove GeoCrumbs from the Wikimedia Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826279 (https://phabricator.wikimedia.org/T316109) (owner: 10Jon Harald Søby)
[09:02:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) > I am seeing an issue on our SUBSEA portion of the circuit. I am engaging my SUBSEA group at this time.
[09:03:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1006.eqiad.wmnet
[09:04:24] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1008.eqiad.wmnet
[09:04:41] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@a571f9a]: Add pcmwiki T310880
[09:04:43] <stashbot>	 T310880: Post-creation work for pcmwiki - https://phabricator.wikimedia.org/T310880
[09:05:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] k8s: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829205 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:05:47] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@a571f9a]: Add pcmwiki T310880 (duration: 01m 06s)
[09:06:12] <wikibugs>	 (03PS3) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705)
[09:11:46] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1008.eqiad.wmnet
[09:14:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1010.eqiad.wmnet
[09:16:01] <wikibugs>	 (03PS1) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743
[09:17:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Squid: permit production networks instead of aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[09:17:05] <moritzm>	 !log installing flac security updates
[09:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:55] <XioNoX>	 !log Squid: permit production networks instead of aggregate_networks - T265864 
[09:17:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:57] <stashbot>	 T265864: Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864
[09:20:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 (owner: 10David Caro)
[09:22:26] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:22:35] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1010.eqiad.wmnet
[09:23:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye
[09:23:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[09:23:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye
[09:23:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[09:23:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33779 and previous config saved to /var/cache/conftool/dbconfig/20220905-092338-ladsgroup.json
[09:23:41] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[09:24:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[09:24:36] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[09:24:51] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[09:25:36] <wikibugs>	 (03PS2) 10Jbond: raid: use modern nrpe defines [puppet] - 10https://gerrit.wikimedia.org/r/825740 (owner: 10Majavah)
[09:25:40] <btullis>	 !log deployed calico to dse-k8s cluster T310174
[09:25:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:43] <stashbot>	 T310174: Configure routing for dse-k8s cluster - https://phabricator.wikimedia.org/T310174
[09:25:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/825740 (owner: 10Majavah)
[09:26:22] <wikibugs>	 (03PS1) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[09:27:15] <wikibugs>	 (03PS2) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743
[09:27:17] <wikibugs>	 (03PS2) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[09:28:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816105 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:28:22] <wikibugs>	 (03PS3) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743
[09:28:24] <wikibugs>	 (03PS3) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[09:28:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:admin: add support for deprecated groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825370 (https://phabricator.wikimedia.org/T248161) (owner: 10Jbond)
[09:29:18] <wikibugs>	 (03PS1) 10Jelto: gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463)
[09:29:26] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1012.eqiad.wmnet
[09:30:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[09:32:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apt::noupgrade: remove [puppet] - 10https://gerrit.wikimedia.org/r/826350 (owner: 10Majavah)
[09:32:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/826350 (owner: 10Majavah)
[09:32:51] <wikibugs>	 (03PS4) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[09:33:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[09:34:55] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[09:35:41] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[09:37:03] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[09:37:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] "Comment on where fcgi_proxies ordering comes in needs to be corrected, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[09:38:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1012.eqiad.wmnet
[09:39:31] <wikibugs>	 (03CR) 10Jbond: "LGTM but will leave to wmcs for th4e final approval" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[09:39:32] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1007.eqiad.wmnet with reason: host reimage
[09:39:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[09:41:29] <icinga-wm>	 RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:44:15] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Volans) a:05Volans→03None @Andrew what is the issue that you're still seeing? It looks good to me.  I see that the host is correctly...
[09:44:22] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37113/lists1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828016 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[09:44:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi)
[09:46:44] <wikibugs>	 (03PS4) 10Stang: Move wmgSiteLogoWordmark and wmgSiteLogoTagline to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829563 (https://phabricator.wikimedia.org/T307705)
[09:47:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[09:54:17] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:54:47] <wikibugs>	 (03CR) 10Jbond: "lgtm but see comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff)
[09:56:22] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@79b3cd2]: Add guwwiktionary and bjnwiktionary T309058 T312216
[09:56:25] <stashbot>	 T312216: Add bjnwiktionary to RESTBase - https://phabricator.wikimedia.org/T312216
[09:56:26] <stashbot>	 T309058: Add guwwiktionary to RESTBase - https://phabricator.wikimedia.org/T309058
[09:56:44] <wikibugs>	 (03PS1) 10Volans: ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794)
[09:57:14] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042)
[09:57:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[09:58:20] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[09:59:04] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37114/mx1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828019 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[10:02:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[10:02:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "This LGTM, though I see from https://phabricator.wikimedia.org/T315866#8194791 the alert might be noisy :| Let's go ahead with it though a" [alerts] - 10https://gerrit.wikimedia.org/r/825294 (https://phabricator.wikimedia.org/T315866) (owner: 10Ladsgroup)
[10:03:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754
[10:03:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[10:03:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[10:04:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[10:04:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161)
[10:05:11] <wikibugs>	 (03CR) 10Muehlenhoff: Allow cookbooks to handle restarts based on running one of more commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff)
[10:05:51] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1013.eqiad.wmnet
[10:05:54] <wikibugs>	 (03CR) 10Jbond: "thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/825755 (owner: 10Clément Goubert)
[10:05:58] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37115/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[10:08:51] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042)
[10:09:16] <wikibugs>	 (03CR) 10Roman Stolar: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[10:11:27] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@79b3cd2]: Add guwwiktionary and bjnwiktionary T309058 T312216 (duration: 15m 05s)
[10:11:30] <stashbot>	 T312216: Add bjnwiktionary to RESTBase - https://phabricator.wikimedia.org/T312216
[10:11:31] <stashbot>	 T309058: Add guwwiktionary to RESTBase - https://phabricator.wikimedia.org/T309058
[10:12:16] <wikibugs>	 (03PS1) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982)
[10:12:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042) (owner: 10Giuseppe Lavagetto)
[10:13:00] <wikibugs>	 (03PS2) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982)
[10:13:33] <XioNoX>	 !log upgrade python-pynetbox to 6.6 on netbox frontends - T310745
[10:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:36] <stashbot>	 T310745: Upgrade pynetbox - https://phabricator.wikimedia.org/T310745
[10:14:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1013.eqiad.wmnet
[10:16:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:17:31] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1014.eqiad.wmnet
[10:17:52] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "Thanks for the responses, I think this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah)
[10:19:48] <wikibugs>	 (03PS2) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502
[10:20:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[10:21:22] <wikibugs>	 (03CR) 10Clément Goubert: "Can you take a look? ensure_packages ordering is iffy on a new install and makes us do multiple puppet runs to achieve the desired state." [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[10:21:45] <wikibugs>	 (03PS1) 10Vgutierrez: mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921)
[10:22:15] <wikibugs>	 (03PS3) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502
[10:22:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[10:22:58] <wikibugs>	 (03CR) 10Jbond: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[10:23:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:23:31] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] "Looks good to me." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[10:24:32] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1014.eqiad.wmnet
[10:25:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[10:25:51] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert)
[10:25:59] <wikibugs>	 (03PS1) 10Ayounsi: Enable pynetbox threading for generate_dns_snippets.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486)
[10:26:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10jbond) >>! In T296832#8205878, @Volans wrote: >>>! In T296832#8143427, @cmooney wrote: >> For a bit of context the ab...
[10:26:23] <wikibugs>	 (03PS2) 10Vgutierrez: mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921)
[10:26:47] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan)
[10:26:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, let's test it!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[10:27:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[10:27:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1015.eqiad.wmnet
[10:28:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Enable pynetbox threading for generate_dns_snippets.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829758 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[10:28:13] <wikibugs>	 (03PS3) 10Vgutierrez: Add Trafficserver SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921)
[10:29:34] <wikibugs>	 (03PS3) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500
[10:29:40] <wikibugs>	 (03PS1) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705)
[10:30:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert)
[10:30:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Adding Arnold and Daniel. As far as I am concerned, this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[10:30:47] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Distinguish between internal host and host header setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/829216 (owner: 10Hnowlan)
[10:31:11] <wikibugs>	 (03CR) 10Jbond: global: drop owner/group => root from file resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond)
[10:31:48] <wikibugs>	 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10jcrespo) I believe this was the impact and subsequent mitigation on eqord router (but hopfully someone can confirm): {F35509646}
[10:31:51] <wikibugs>	 10SRE-swift-storage, 10User-fgiunchedi: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Resolving since Thanos retention has been trimmed, more space is being freed as part of {T314835}
[10:32:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff)
[10:32:15] <wikibugs>	 (03CR) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[10:32:24] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[10:34:14] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[10:34:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove sre.misc-clusters.sretest [cookbooks] - 10https://gerrit.wikimedia.org/r/829024
[10:35:29] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[10:36:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] mtail::atsbackend: Increase processing time to 150ms [puppet] - 10https://gerrit.wikimedia.org/r/829757 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez)
[10:36:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1015.eqiad.wmnet
[10:37:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:37:27] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:37:36] <wikibugs>	 (03PS2) 10Ayounsi: Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745)
[10:37:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:37:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33780 and previous config saved to /var/cache/conftool/dbconfig/20220905-103749-ladsgroup.json
[10:37:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829204 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:37:52] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:39:40] <wikibugs>	 (03PS2) 10Stang: Re-download and optimize wordmark/tagline svg file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829760 (https://phabricator.wikimedia.org/T307705)
[10:41:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove sre.misc-clusters.sretest [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff)
[10:41:56] <wikibugs>	 (03Merged) 10jenkins-bot: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[10:43:18] <wikibugs>	 (03PS1) 10MVernon: swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049)
[10:43:23] <wikibugs>	 (03PS1) 10Stang: Drop unused wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829764 (https://phabricator.wikimedia.org/T307705)
[10:44:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Fine by me :)" [puppet] - 10https://gerrit.wikimedia.org/r/820748 (owner: 10Muehlenhoff)
[10:44:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove sre.misc-clusters.sretest (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/829024 (owner: 10Muehlenhoff)
[10:45:41] <wikibugs>	 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) >  one is e.g. Keith losing his VMs because of not knowing the context TBH i think this is the issue we need to reso...
[10:49:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:50:07] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:51:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:52:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P33781 and previous config saved to /var/cache/conftool/dbconfig/20220905-105255-ladsgroup.json
[10:52:59] <wikibugs>	 (03PS1) 10Hnowlan: Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104)
[10:55:19] <Emperor>	 !log set thanos ring replicas to 3.90 T311690
[10:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:22] <stashbot>	 T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690
[10:55:46] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Add clean-stale-puppet-certs script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott)
[10:57:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:00:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[11:01:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff)
[11:02:38] <wikibugs>	 (03CR) 10Slavina Stefanova: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[11:02:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[11:03:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829743 (owner: 10David Caro)
[11:04:42] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet
[11:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P33782 and previous config saved to /var/cache/conftool/dbconfig/20220905-110801-ladsgroup.json
[11:12:19] <wikibugs>	 (03PS1) 10Jbond: C:cpufrequtils: Add package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/829791
[11:12:21] <wikibugs>	 (03PS1) 10Jbond: C:cpufrequtils: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/829792
[11:15:22] <wikibugs>	 (03CR) 10Jbond: "See comments, i created a new CR in https://gerrit.wikimedia.org/r/c/operations/puppet/+/829791 with the suggestions" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[11:15:23] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet
[11:15:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37116/console" [puppet] - 10https://gerrit.wikimedia.org/r/829791 (owner: 10Jbond)
[11:15:57] <logmsgbot>	 !log tstarling@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test
[11:16:00] <stashbot>	 T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847
[11:16:12] <logmsgbot>	 !log tstarling@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test
[11:16:19] <claime>	 !log pooled parse1003.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638
[11:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] C:cpufrequtils: Add package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/829791 (owner: 10Jbond)
[11:17:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:cpufrequtils: update documentation [puppet] - 10https://gerrit.wikimedia.org/r/829792 (owner: 10Jbond)
[11:18:14] <TimStarling>	 !log on db2142: stopped mariadb replication
[11:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T314041)', diff saved to https://phabricator.wikimedia.org/P33783 and previous config saved to /var/cache/conftool/dbconfig/20220905-112308-ladsgroup.json
[11:23:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[11:23:11] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:23:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[11:24:55] <claime>	 !log depooled wtp1036.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638
[11:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1003.eqiad.wmnet
[11:27:38] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1003.eqiad.wmnet
[11:29:58] <TimStarling>	 !log on db2142: set master_delay=30 and restarted replication T316847
[11:30:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:01] <stashbot>	 T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847
[11:30:01] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1004.eqiad.wmnet
[11:32:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[11:32:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1034-1036].eqiad.wmnet with reason: Downtiming replaced wtp servers
[11:32:47] <wikibugs>	 (03CR) 10Roman Stolar: [C: 03+1] "Great!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[11:32:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1034-1036].eqiad.wmnet with reason: Downtiming replaced wtp servers
[11:34:08] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1034.eqiad.wmnet
[11:34:16] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1035.eqiad.wmnet
[11:36:40] <claime>	 !log Set wtp103[4-5].eqiad.wmnet inactive pending decommission https://phabricator.wikimedia.org/T317025
[11:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:36] <TimStarling>	 !log on db2142: dropping inbound mysql traffic T316847
[11:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:38] <stashbot>	 T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847
[11:39:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10BTullis)
[11:40:52] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.16.0" for 584 hosts
[11:40:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) > LUMEN Subsea group performed a cold reset on a card in Bude England to restore service. I am seeing traffic up at this time.
[11:41:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox interface ID cr2-eqiad:xe-4/1/3
[11:41:10] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.16.0" completed for 584 hosts
[11:41:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox interface ID cr2-eqiad:xe-4/1/3
[11:41:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox interface ID cr2-eqiad:xe-4/1/3 --- **Interface cr2-eqiad:xe-4/1/3** -  admin-status: up -  oper-...
[11:42:21] <wikibugs>	 (03CR) 10Clément Goubert: "Dropping in favour of https://gerrit.wikimedia.org/r/c/operations/puppet/+/829791" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[11:42:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Lumen link between cr2-eqiad and cr2-esams down - Sept 2022 - https://phabricator.wikimedia.org/T317009 (10ayounsi) 05Open→03Resolved a:03ayounsi
[11:42:35] <wikibugs>	 (03Abandoned) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[11:43:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[11:43:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[11:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33784 and previous config saved to /var/cache/conftool/dbconfig/20220905-114352-ladsgroup.json
[11:43:55] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[11:44:58] <wikibugs>	 (03PS2) 10Ayounsi: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486)
[11:45:06] <wikibugs>	 (03PS1) 10Hnowlan: Add script for automating joining a single node to the cluster [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/829807 (https://phabricator.wikimedia.org/T309619)
[11:46:33] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[11:47:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppet_compiler: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829200 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:47:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10BTullis) I have applied this change. {F35509708,width=60%}
[11:51:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet
[11:51:13] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on parse1004 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[11:52:54] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1004.eqiad.wmnet
[11:52:54] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1004.eqiad.wmnet
[11:53:52] <claime>	 !log pooled parse1004.eqiad.wmnet (php 7.4 only) in parsoid cluster T312638
[11:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:55] <stashbot>	 T312638: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638
[11:55:04] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet
[11:55:22] <TimStarling>	 !log on db2142: rejecting inbound mysql traffic T316847
[11:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:24] <stashbot>	 T316847: Production test of x2 failure modes - https://phabricator.wikimedia.org/T316847
[11:56:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[11:56:17] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse[1001-1004].eqiad.wmnet
[11:56:18] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[1001-1004].eqiad.wmnet
[11:59:09] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:01:14] <wikibugs>	 (03CR) 10Jbond: "took another pass thanks" [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede)
[12:02:29] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:03:17] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[12:05:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon)
[12:09:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet
[12:10:00] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1007.eqiad.wmnet with OS bullseye
[12:10:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec...
[12:10:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1004.mgmt
[12:10:23] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1004.mgmt
[12:10:37] <logmsgbot>	 !log tstarling@cumin1001 START - Cookbook sre.hosts.remove-downtime for db[2142-2144].codfw.wmnet
[12:10:38] <logmsgbot>	 !log tstarling@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[2142-2144].codfw.wmnet
[12:13:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet
[12:14:10] <claime>	 !log depooled wtp1037.eqiad.wmnet from parsoid cluster T312638
[12:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:12] <stashbot>	 T312638: Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638
[12:14:29] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed [puppet] - 10https://gerrit.wikimedia.org/r/829763 (https://phabricator.wikimedia.org/T314049) (owner: 10MVernon)
[12:16:40] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1005.eqiad.wmnet
[12:16:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye
[12:17:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye
[12:18:44] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet
[12:20:01] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[12:20:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 18 hosts with reason: Downtime pending inclusion in production
[12:20:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 18 hosts with reason: Downtime pending inclusion in production
[12:22:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet
[12:24:03] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet
[12:25:44] <wikibugs>	 (03PS1) 10David Caro: build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811
[12:26:32] <wikibugs>	 (03CR) 10David Caro: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[12:31:20] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1005,parse1005.mgmt
[12:31:21] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1005,parse1005.mgmt
[12:31:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:33:26] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host datahubsearch1003.eqiad.wmnet
[12:38:39] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582)
[12:47:39] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1007.eqiad.wmnet with OS bullseye
[12:47:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye exec...
[12:48:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1011.eqiad.wmnet with OS bullseye
[12:48:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye
[12:49:19] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582)
[12:50:06] <wikibugs>	 (03PS2) 10Ayounsi: Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218)
[12:51:06] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582)
[12:51:12] <wikibugs>	 (03CR) 10Slavina Stefanova: bullseye0: Add bullseye buildpack build/run images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[12:56:17] <icinga-wm>	 PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:31] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1300).
[13:00:05] <jouncebot>	 Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:23] <urbanecm>	 o/
[13:00:54] <Tchanders>	 Hi
[13:01:20] <urbanecm>	 hi Tchanders. I can deploy today :)
[13:01:22] <wikibugs>	 (03PS2) 10Urbanecm: Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders)
[13:01:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders)
[13:01:34] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[13:02:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable partial action blocks on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829012 (https://phabricator.wikimedia.org/T315525) (owner: 10Tchanders)
[13:03:35] <urbanecm>	 Tchanders: your patch is at mwdebug1001. can you check?
[13:03:46] <Tchanders>	 urbanecm: Thanks, testing...
[13:03:51] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:05:10] <Tchanders>	 urbanecm: Looks good to me
[13:05:12] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1011.eqiad.wmnet with reason: host reimage
[13:05:15] <urbanecm>	 thanks, syncing
[13:06:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:07:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:07:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:07:36] <moritzm>	 !log disabling puppet in codfw and the edges temporarily
[13:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:23] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:08:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:08:45] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:09:07] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: edbcee4d9a901ce475ebcc53e4c4bc18e04bc2b8: Enable partial action blocks on fawiki (T315525) (duration: 03m 34s)
[13:09:11] <stashbot>	 T315525: Deploy action blocks to pilot wikis - https://phabricator.wikimedia.org/T315525
[13:09:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:09:14] <urbanecm>	 Tchanders: and, should be live
[13:09:16] <urbanecm>	 anything else?
[13:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:09:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33785 and previous config saved to /var/cache/conftool/dbconfig/20220905-130944-ladsgroup.json
[13:09:45] <Tchanders>	 urbanecm: Thank you, as always! All good now
[13:09:49] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[13:09:49] <urbanecm>	 okay!
[13:09:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) @Cmjohnson - sorry to trouble you with this old ticket, but I'm having an issue with three of these new an-presto hosts.  * an-presto...
[13:09:59] <urbanecm>	 !log UTC afternoon B&C window done
[13:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb
[13:11:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb
[13:13:02] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1011.eqiad.wmnet with OS bullseye
[13:13:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-presto1011.eqiad.wmnet with OS bullseye exec...
[13:14:55] <icinga-wm>	 PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:16:25] <icinga-wm>	 RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1135512 bytes in 5.848 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard
[13:17:49] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:18:47] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:22:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:25:36] <vgutierrez>	 uh
[13:27:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] keyholder: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/829201 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:30:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) Great, thanks for checking it @jbond
[13:31:13] <addshore>	 !log wdqs1009 sudo systemctl stop wdqs-blazegraph.service
[13:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[13:33:04] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: fix dry-run behaviour [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829751 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[13:34:25] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[13:36:18] <wikibugs>	 (03PS3) 10Volans: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[13:36:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[13:37:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable pynetbox threading for DNS/Ganeti/Mgmt scripts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/828035 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi)
[13:38:31] <wikibugs>	 (03CR) 10Jbond: "LGTM but some minor issues" [puppet] - 10https://gerrit.wikimedia.org/r/829016 (owner: 10Giuseppe Lavagetto)
[13:39:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff)
[13:41:00] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] gitlab: add gitlab::release::binary [puppet] - 10https://gerrit.wikimedia.org/r/829016 (owner: 10Giuseppe Lavagetto)
[13:41:47] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[13:47:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Mark several access groups as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/829754 (https://phabricator.wikimedia.org/T248161) (owner: 10Muehlenhoff)
[13:48:28] <claime>	 !log pooled parse1005.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[13:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:33] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[13:50:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] api_appserver: convert all canaries to php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829217 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[13:50:29] <wikibugs>	 (03PS1) 10Volans: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794)
[13:51:49] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[13:52:07] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[13:54:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[14:01:33] <claime>	 !log depooled wtp1037.eqiad.wmnet from parsoid cluster T307219
[14:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:36] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[14:02:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:02:47] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:11:16] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1006.eqiad.wmnet
[14:13:34] <wikibugs>	 (03PS2) 10Volans: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794)
[14:13:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[14:19:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro)
[14:20:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[14:21:01] <wikibugs>	 (03Merged) 10jenkins-bot: tools/ganeti-netbox-sync: additional dry-run fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829817 (https://phabricator.wikimedia.org/T314794) (owner: 10Volans)
[14:21:55] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1006,parse1006.mgmt
[14:21:55] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1006,parse1006.mgmt
[14:22:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33786 and previous config saved to /var/cache/conftool/dbconfig/20220905-142240-ladsgroup.json
[14:22:42] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[14:22:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] canary_appserver: use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/829550 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto)
[14:23:39] <claime>	 !log pooled parse1006.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[14:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:42] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[14:26:20] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:26:38] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:28:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:28:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:29:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:29:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:30:26] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[14:30:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[14:32:23] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:34] <claime>	 !log depooled wtp1039.eqiad.wmnet from parsoid cluster T307219
[14:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:38] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[14:34:57] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:37:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33788 and previous config saved to /var/cache/conftool/dbconfig/20220905-143746-ladsgroup.json
[14:40:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1036-1038].eqiad.wmnet with reason: Downtiming replace wtp servers
[14:40:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1036-1038].eqiad.wmnet with reason: Downtiming replace wtp servers
[14:42:00] <wikibugs>	 (03PS1) 10Btullis: Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175)
[14:42:08] <wikibugs>	 (03CR) 10FNegri: "If you don't see any downsides, I would suggest rebasing this change on the main branch, combining this patch with https://gerrit.wikimedi" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro)
[14:46:43] <godog>	 !add 100G to prometheus codfw / global instance
[14:46:46] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1036.eqiad.wmnet
[14:47:03] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1037.eqiad.wmnet
[14:48:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[14:48:51] <claime>	 !log Set wtp103[6-7].eqiad.wmnet inactive pending decommission T317025
[14:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:54] <stashbot>	 T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025
[14:50:56] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "This won't work as cloudmetrics* hosts are in the production private network and so can't access cloud vps endpoints directly" [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro)
[14:50:58] <wikibugs>	 (03PS2) 10Btullis: Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175)
[14:52:40] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] p::metricsinfra:haproxy: Allow exposing federation endpoints (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[14:52:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33789 and previous config saved to /var/cache/conftool/dbconfig/20220905-145252-ladsgroup.json
[14:53:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[14:53:23] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[15:02:55] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:04:36] <moritzm>	 !log updating docker.io on gitlab-runners
[15:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:46] <wikibugs>	 (03Merged) 10jenkins-bot: Fix online-tests in blubber container [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/829786 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[15:06:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[15:07:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T314041)', diff saved to https://phabricator.wikimedia.org/P33790 and previous config saved to /var/cache/conftool/dbconfig/20220905-150758-ladsgroup.json
[15:08:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[15:08:02] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[15:08:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[15:08:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:08:27] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10colewhite) p:05Triage→03High The cluster will remain in a degraded state until replacements are installed.  Please replace the failed disks as soon as possible.  Thanks!
[15:08:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33791 and previous config saved to /var/cache/conftool/dbconfig/20220905-150837-ladsgroup.json
[15:09:17] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1007.eqiad.wmnet
[15:09:43] <icinga-wm>	 RECOVERY - memcached socket on parse1007 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached
[15:15:15] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:16:53] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1007,parse1007.mgmt
[15:16:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1007,parse1007.mgmt
[15:17:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) a:05Jelto→03pfischer
[15:17:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:17:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[15:18:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] cadvisor_exporter: Remove check fo Stretch [puppet] - 10https://gerrit.wikimedia.org/r/820748 (owner: 10Muehlenhoff)
[15:19:07] <claime>	 !log pooled parse1007.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219
[15:19:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:11] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[15:19:53] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Add a deprecated flag to admin groups - https://phabricator.wikimedia.org/T248161 (10jbond) 05Open→03Resolved implmented
[15:23:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove peek-admins grup [puppet] - 10https://gerrit.wikimedia.org/r/829828
[15:27:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[15:28:24] <claime>	 !log depooled wtp1040.eqiad.wmnet from parsoid cluster T307219
[15:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:27] <stashbot>	 T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219
[15:29:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi)
[15:30:04] <jouncebot>	 jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1530).
[15:30:48] <moritzm>	 !log installing apache2 security updates
[15:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:51] <logmsgbot>	 !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1038.eqiad.wmnet
[15:33:12] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "The combined diff of this patch and 829031 is minimal (it's basically only s/buster/bullseye/) and I verified I can build the new image us" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro)
[15:33:58] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "Works fine on my Mac" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro)
[15:36:12] <wikibugs>	 (03PS1) 10Volans: ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830
[15:41:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:46:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:46:14] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] "not explicitly tested but shutil.which("docker") returns expected value on my machine so +1" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro)
[15:48:26] <wikibugs>	 (03CR) 10FNegri: "LGTM, I only have a small question (see inline comment). What would be the best way to test this? What is a scenario where you want to "ch" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[15:52:04] <wikibugs>	 (03PS4) 10David Caro: p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743
[15:52:06] <wikibugs>	 (03PS5) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031)
[15:52:08] <wikibugs>	 (03CR) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro)
[15:52:10] <wikibugs>	 (03PS3) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982)
[15:52:12] <wikibugs>	 (03CR) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro)
[15:55:40] <wikibugs>	 (03CR) 10David Caro: bullseye0: Improve the install-packages script (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[15:59:39] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[16:04:16] <wikibugs>	 (03CR) 10David Caro: Remove buster0 buildpacks images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro)
[16:05:00] <wikibugs>	 (03PS2) 10David Caro: bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854)
[16:05:02] <wikibugs>	 (03CR) 10David Caro: bullseye0: Improve the install-packages script (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[16:05:05] <wikibugs>	 (03PS2) 10David Caro: build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811
[16:12:45] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:17:05] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:39] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main
[16:27:14] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[16:29:23] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[16:30:15] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:32:25] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:39:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) (owner: 10Ayounsi)
[16:44:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi)
[16:44:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi)
[16:45:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans)
[16:47:11] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:00:04] <jouncebot>	 ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T1700)
[17:00:41] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:25] <wikibugs>	 (03CR) 10Hashar: Json schema from Gerrit Java event classes (036 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[17:09:33] <wikibugs>	 (03PS4) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947)
[17:16:39] <wikibugs>	 (03CR) 10Hashar: "Instead of having all the java classes in the same directory, I have split them in:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[17:16:45] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro)
[17:17:45] <wikibugs>	 (03CR) 10Hashar: "The coverage report yields:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[17:20:28] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] Remove buster0 buildpacks images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro)
[17:29:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "post-merge +1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/818114 (owner: 10Jbond)
[17:29:13] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:59] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:35:51] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[17:37:08] <wikibugs>	 (03PS1) 10Volans: Simplify cumin query in comment for confd [dns] - 10https://gerrit.wikimedia.org/r/829856 (https://phabricator.wikimedia.org/T314489)
[17:39:31] <wikibugs>	 (03PS5) 10Volans: cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond)
[17:39:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33792 and previous config saved to /var/cache/conftool/dbconfig/20220905-173951-ladsgroup.json
[17:39:55] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[17:41:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond)
[17:51:31] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.092 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:53:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[17:53:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:54:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance
[17:54:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T314041)', diff saved to https://phabricator.wikimedia.org/P33793 and previous config saved to /var/cache/conftool/dbconfig/20220905-175423-ladsgroup.json
[17:54:26] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[17:54:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33794 and previous config saved to /var/cache/conftool/dbconfig/20220905-175457-ladsgroup.json
[17:56:25] <wikibugs>	 (03PS1) 10Ladsgroup: Improvements on css [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858
[17:59:44] <wikibugs>	 (03PS3) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562
[18:04:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/819568 (owner: 10Ayounsi)
[18:07:23] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:10:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P33795 and previous config saved to /var/cache/conftool/dbconfig/20220905-181003-ladsgroup.json
[18:13:52] <wikibugs>	 (03PS4) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673
[18:19:17] <wikibugs>	 (03PS5) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059)
[18:23:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[18:24:06] <wikibugs>	 (03CR) 10AOkoth: vrts: install vrts script (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[18:25:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T312863)', diff saved to https://phabricator.wikimedia.org/P33796 and previous config saved to /var/cache/conftool/dbconfig/20220905-182510-ladsgroup.json
[18:25:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[18:25:13] <stashbot>	 T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863
[18:25:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[18:25:20] <wikibugs>	 (03PS6) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059)
[18:30:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33797 and previous config saved to /var/cache/conftool/dbconfig/20220905-183017-ladsgroup.json
[18:31:05] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37120/" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[18:45:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33798 and previous config saved to /var/cache/conftool/dbconfig/20220905-184522-ladsgroup.json
[19:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33799 and previous config saved to /var/cache/conftool/dbconfig/20220905-190027-ladsgroup.json
[19:15:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Maint needs to be redone', diff saved to https://phabricator.wikimedia.org/P33800 and previous config saved to /var/cache/conftool/dbconfig/20220905-191532-ladsgroup.json
[19:19:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.088 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:21:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[19:25:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:25:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:25:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:25:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:25:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33801 and previous config saved to /var/cache/conftool/dbconfig/20220905-192554-ladsgroup.json
[19:25:57] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:50:29] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:15] <urbanecm>	 indeed, nothing to do
[20:18:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.081 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:20:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:24:53] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:32:39] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:38:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33802 and previous config saved to /var/cache/conftool/dbconfig/20220905-203824-ladsgroup.json
[20:38:28] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[20:49:12] <wikibugs>	 (03PS8) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807
[20:53:15] <wikibugs>	 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Platonides) I suspect it's a timeout at Varnish level, and it then got cached somewhere. I was always getting 200, even when asking every dc:  ` for dc i...
[20:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33803 and previous config saved to /var/cache/conftool/dbconfig/20220905-205330-ladsgroup.json
[20:55:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.288 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220905T2100).
[21:00:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:01:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.247 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:03:33] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33804 and previous config saved to /var/cache/conftool/dbconfig/20220905-210837-ladsgroup.json
[21:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33805 and previous config saved to /var/cache/conftool/dbconfig/20220905-212343-ladsgroup.json
[21:23:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[21:23:47] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[21:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[21:24:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33806 and previous config saved to /var/cache/conftool/dbconfig/20220905-212415-ladsgroup.json
[21:25:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans)
[21:26:01] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: fail on missing cluster group [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829830 (owner: 10Volans)
[21:29:05] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Legoktm) It seems like page history caches are not being invalidated properly, which I su...
[21:30:06] <wikibugs>	 (03PS1) 10Volans: ganeti-netbox-sync: add missing space in exception [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864
[21:30:22] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Just adding a space, self-merging" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864 (owner: 10Volans)
[21:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: add missing space in exception [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/829864 (owner: 10Volans)
[21:32:24] <wikibugs>	 10SRE, 10MediaWiki-Page-history, 10Traffic, 10Regression: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Legoktm)
[21:39:27] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: an-presto1011, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w
[21:39:27] <icinga-wm>	 wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[21:54:33] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:57:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, just couple of typos inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond)
[22:03:27] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: an-presto1011, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w
[22:03:27] <icinga-wm>	 wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[22:14:09] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.246 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:14:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:14:53] <wikibugs>	 (03CR) 10Volans: "post-merge FYI comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/823704 (https://phabricator.wikimedia.org/T315360) (owner: 10Ryan Kemper)
[22:16:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:08] <wikibugs>	 (03Abandoned) 10Volans: admin: add sre-admins to the check for ops [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans)
[22:22:09] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:36:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33807 and previous config saved to /var/cache/conftool/dbconfig/20220905-223657-ladsgroup.json
[22:37:00] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[22:52:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33808 and previous config saved to /var/cache/conftool/dbconfig/20220905-225203-ladsgroup.json
[22:55:55] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:07:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33809 and previous config saved to /var/cache/conftool/dbconfig/20220905-230709-ladsgroup.json
[23:22:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33810 and previous config saved to /var/cache/conftool/dbconfig/20220905-232216-ladsgroup.json
[23:22:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[23:22:19] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[23:22:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[23:22:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33811 and previous config saved to /var/cache/conftool/dbconfig/20220905-232237-ladsgroup.json