[00:00:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:01] <rzl>	 !refresh-topic
[00:09:26] <rzl>	 man I get baited by that every time
[00:09:28] <rzl>	 !oncall-now
[00:09:28] <sirenbot>	 Oncall now for team SRE, rotation batphone:
[00:09:28] <sirenbot>	 m.utante, j.hathaway, c.white, s.lyngs, l.mata, h.erron, r.zl, b.black, c.danis, s.ukhe, i.nflatador, v.olans, r.obh, u.random
[00:09:47] <rzl>	 cool, next auto update should fix the topic
[00:10:40] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:12:12] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[00:15:38] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:19:16] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:48:04] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:55:55] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[01:00:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:03:35] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10phaultfinder)
[01:05:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:16] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:25] <wikibugs>	 (03PS1) 10Sharvaniharan: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995
[01:57:25] <wikibugs>	 (03CR) 10Sharvaniharan: "Please review when you get a chance. Minor path change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[01:58:50] <wikibugs>	 (03CR) 10Sharvaniharan: "Hi @Ottomata... Is it enough to just change the path and get it deployed again, or will I need to add a new entry?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[02:03:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:17] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456)
[02:20:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0300)
[03:07:49] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585)
[03:07:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[03:22:45] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[03:24:56] <wikibugs>	 (03PS7) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729)
[03:26:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[03:30:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:33:09] <wikibugs>	 (03PS8) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729)
[03:35:00] <wikibugs>	 (03CR) 10jenkins-bot: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[03:36:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:40:21] <wikibugs>	 (03PS9) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729)
[03:42:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[03:44:57] <wikibugs>	 (03PS10) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729)
[03:45:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 (owner: 10Andrew Bogott)
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0400)
[04:01:28] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585)
[04:01:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[04:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[04:02:37] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.22  refs T325585
[04:02:56] <stashbot>	 T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585
[04:08:52] <icinga-wm>	 PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - free space: /srv 13208 MB (4% inode=77%): /srv/docker/overlay2/4e3cf33c6b5d21c9736e6bffc3ee5015324fa3d16a380e781af0d2df28ac71f8/merged 13208 MB (4% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops
[04:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:35:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:55:48] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.22  refs T325585 (duration: 53m 11s)
[04:56:07] <stashbot>	 T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585
[04:58:10] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.20 (duration: 02m 20s)
[05:04:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Sreeji...
[05:15:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:20:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:34:54] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:34:58] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:38:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:39:46] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:40:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:40:16] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:41:04] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:45:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:05:48] <wikibugs>	 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) @Jhancock.wm can you confirm the server is meant to be off? I just tried to access it but I can't.
[06:14:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui)
[06:18:17] <wikibugs>	 (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404)
[06:18:27] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui)
[06:22:28] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:22:36] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:28:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1187', diff saved to https://phabricator.wikimedia.org/P43757 and previous config saved to /var/cache/conftool/dbconfig/20230207-062826-root.json
[06:30:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:31:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2110 in API', diff saved to https://phabricator.wikimedia.org/P43758 and previous config saved to /var/cache/conftool/dbconfig/20230207-063147-root.json
[06:35:24] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:23] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931)
[06:55:25] <marostegui>	 I am going to switch phabricator to read only for a minute in 5 minutes to complete https://phabricator.wikimedia.org/T328404
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0700)
[07:00:06] <marostegui>	 !log Failover m3 from db1159 to db1164 - T328404
[07:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:10] <stashbot>	 T328404: Switchover m3 master db1159 -> db1164 - https://phabricator.wikimedia.org/T328404
[07:00:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:19] <marostegui>	 This was done, read only time was 20 seconds
[07:03:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui)
[07:05:44] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:05:50] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:06:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:13:37] <wikibugs>	 (03CR) 10Ayounsi: Add BGP community to all k8s advertisments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[07:15:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:18:37] <wikibugs>	 (03PS6) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523)
[07:18:56] <wikibugs>	 (03PS11) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[07:20:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:57:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add BGP community to all k8s advertisments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0800).
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:11] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00242: FAILED: internal_api_error_UploadChunkFileException: [3ef1b160-5844-46c6-9ddd-333c27...
[08:05:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:08] <kostajh>	 hi, I have a config patch to deploy, will add to the calendar now
[08:13:08] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757)
[08:14:39] <logmsgbot>	 !log kharlan@deploy1002 backport aborted:  (duration: 00m 07s)
[08:15:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan)
[08:15:54] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan)
[08:16:39] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]]
[08:16:42] <stashbot>	 T328757: Leveling up: Define feature flag for gating the functionality - https://phabricator.wikimedia.org/T328757
[08:18:30] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[08:18:36] <wikibugs>	 (03PS8) 10Volans: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[08:19:17] <wikibugs>	 (03PS1) 10Kosta Harlan: labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271
[08:20:31] <wikibugs>	 (03PS7) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523)
[08:21:28] <kostajh>	 hmm, this is new `Check 'Check endpoints for mw1416.eqiad.wmnet' failed: /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 500 (expecting: 200)`
[08:21:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:21:45] <kostajh>	 ^ cc Amir1 urbanecm if you're around
[08:22:09] <kostajh>	 otherwise, `sync-check-canaries` finished without incident
[08:23:05] <wikibugs>	 (03PS8) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523)
[08:24:24] <urbanecm>	 kostajh: seems to be an onetime error; Special:Version appears to work fine at that server now.
[08:26:34] <kostajh>	 ack
[08:28:50] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]] (duration: 12m 11s)
[08:28:54] <stashbot>	 T328757: Leveling up: Define feature flag for gating the functionality - https://phabricator.wikimedia.org/T328757
[08:30:04] <wikibugs>	 (03PS9) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523)
[08:30:22] <kostajh>	 urbanecm: do you want me to sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/883153 and its child patch?
[08:30:29] <wikibugs>	 (03CR) 10Ayounsi: "Thanks for the help here and on IRC!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:30:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:51] <urbanecm>	 kostajh: sure, that'd be great.
[08:30:55] <kostajh>	 ack
[08:30:55] <moritzm>	 !log installing imagemagick security updates on Thumbor T328901
[08:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:59] <wikibugs>	 (03PS2) 10Kosta Harlan: Remove GEMentorProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:31:09] <urbanecm>	 i've removed my -2 on t. thanks!
[08:31:27] <wikibugs>	 (03CR) 10Kosta Harlan: Remove GEMentorProvider (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:31:34] <wikibugs>	 (03PS2) 10Kosta Harlan: [Growth] Remove mentor list variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:32:14] <kostajh>	 adding to the calendar
[08:33:29] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:34:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:34:46] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) Staging the new version on the switches: `asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-4300-21.4R3-S1.5-signed...
[08:35:03] <wikibugs>	 (03Merged) 10jenkins-bot: Remove GEMentorProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:35:06] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] Remove mentor list variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm)
[08:35:30] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]]
[08:35:33] <stashbot>	 T321501: Post-structured mentor list cleanup - https://phabricator.wikimedia.org/T321501
[08:35:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Allow AS loops in eqiad staging k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/886328 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:37:20] <logmsgbot>	 !log kharlan@deploy1002 urbanecm and kharlan: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[08:37:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of minor things inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[08:39:33] <wikibugs>	 (03CR) 10Elukey: "Hi folks! I have another cookbook that uses this to upgrade a k8s cluster, do you have a target date to have it merged? No hurry just to s" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[08:39:55] <urbanecm>	 kostajh: do you want me to test those?
[08:40:09] <wikibugs>	 (03CR) 10Elukey: "This is currently blocked by https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[08:40:28] <kostajh>	 urbanecm: sure, they are ready for verification
[08:41:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris)
[08:42:04] <urbanecm>	 kostajh: cswiki continues to be on structured mentor list, so those should be good to go
[08:42:10] <kostajh>	 urbanecm: nothing seems broken 
[08:42:12] <urbanecm>	 yup
[08:45:02] <kostajh>	 (syncing)
[08:45:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:45:35] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) FWIW, I can't find any trace of that filename in the swift proxy logs ` cumin -x O:swift:...
[08:47:00] <wikibugs>	 (03Merged) 10jenkins-bot: Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris)
[08:48:18] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]] (duration: 12m 48s)
[08:48:24] <kostajh>	 done
[08:48:35] <stashbot>	 T321501: Post-structured mentor list cleanup - https://phabricator.wikimedia.org/T321501
[08:48:44] <kostajh>	 !log UTC morning deploys done
[08:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[08:49:55] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:12] <vgutierrez>	 !log rolling upgrade to HAProxy 2.4.21 in cp nodes
[08:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:46] <wikibugs>	 (03PS12) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[08:50:49] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:52:53] <wikibugs>	 (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[08:56:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I was quite happy with this version, so +1 from me" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[08:58:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517)
[08:59:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: admin: add 'anil' to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805)
[09:02:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast2002.wikimedia.org with OS bullseye
[09:02:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast2002.wikimedia.org with OS bullseye
[09:03:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[09:03:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805) (owner: 10Filippo Giunchedi)
[09:04:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add 'anil' to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805) (owner: 10Filippo Giunchedi)
[09:05:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi)
[09:05:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi)
[09:05:49] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517)
[09:06:11] <wikibugs>	 (03CR) 10JMeybohm: Add sre.discovery.datacenter-route (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[09:06:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2] admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi)
[09:07:51] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Enable notifications db1164 [puppet] - 10https://gerrit.wikimedia.org/r/887274
[09:08:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons.
[09:08:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10fgiunchedi)
[09:08:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications db1164 [puppet] - 10https://gerrit.wikimedia.org/r/887274 (owner: 10Marostegui)
[09:08:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi No worries @Ottomata -- thanks for following up!  Access will be fully live in 30 min, resolving. Though pleas...
[09:15:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:19:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[09:19:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[09:19:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2002.wikimedia.org with reason: host reimage
[09:20:01] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[09:20:22] <akosiaris>	 !log add wiktionary to mobile-sections rerenders. T226931
[09:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:25] <stashbot>	 T226931: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931
[09:20:28] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[09:20:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[09:20:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:21:38] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10taavi)
[09:21:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[09:21:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:22:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2002.wikimedia.org with reason: host reimage
[09:22:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:23:18] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10taavi)
[09:23:54] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[09:24:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[09:24:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.813 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:24:59] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:25:44] <wikibugs>	 10SRE, 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10taavi)
[09:27:34] <wikibugs>	 10SRE, 10Cloud-VPS, 10DNS, 10Traffic: PDNS in cloud can return inconsistent answers - https://phabricator.wikimedia.org/T281700 (10taavi)
[09:30:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10taavi) 05Open→03Resolved Per T249035#6122874 and lack of further updates.
[09:31:39] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10taavi)
[09:34:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inlines comments. Nice approach!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[09:35:53] <wikibugs>	 (03PS1) 10Nicolas Fraison: Add information and command rights on icinga to Nicolas Fraison [puppet] - 10https://gerrit.wikimedia.org/r/887276
[09:37:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: update message text fetching [software] - 10https://gerrit.wikimedia.org/r/887279
[09:39:07] <wikibugs>	 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations: The Rack Puppet master server is deprecated and will be removed in a future release. Please use Puppet Server instead. - https://phabricator.wikimedia.org/T185815 (10taavi) 05Open→03Invalid I suspect this will be fixed by the Puppet 7 upg...
[09:40:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[09:40:28] <wikibugs>	 (03CR) 10Clément Goubert: Add sre.discovery.datacenter-route (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[09:40:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: update message text fetching [software] - 10https://gerrit.wikimedia.org/r/887279 (owner: 10Filippo Giunchedi)
[09:42:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2002.wikimedia.org with OS bullseye
[09:42:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast2002.wikimedia.org with OS bullseye completed: - bast2002 (**PASS**)   - Downtimed on Icinga/Alertm...
[09:42:49] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:44:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast1003.wikimedia.org with OS bullseye
[09:44:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast1003.wikimedia.org with OS bullseye
[09:45:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887282 (https://phabricator.wikimedia.org/T295774)
[09:49:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:05] <wikibugs>	 (03PS5) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[09:51:53] <wikibugs>	 (03PS1) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925)
[09:53:07] <wikibugs>	 (03PS1) 10Elukey: profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285
[09:54:39] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39417/console" [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey)
[09:56:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1003.wikimedia.org with reason: host reimage
[09:57:45] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[09:58:41] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Remove outdated invert-redis-sessions cookbook - https://phabricator.wikimedia.org/T329020 (10Clement_Goubert)
[10:00:21] <wikibugs>	 (03PS1) 10Clément Goubert: sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020)
[10:00:21] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:06] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[10:01:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1003.wikimedia.org with reason: host reimage
[10:02:02] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[10:02:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert)
[10:02:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:02:54] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert)
[10:03:03] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903 (10taavi)
[10:03:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison)
[10:03:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi)
[10:03:31] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903 (10taavi) 05Open→03Declined Per above.
[10:04:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:04:37] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert)
[10:04:57] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:05:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The stat hosts are still on Buster, which doesn't have s3cmd, but it's present in buster-backports and will get installed from" [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey)
[10:08:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887288 (https://phabricator.wikimedia.org/T295774)
[10:12:04] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in eqiad: Pooling eqiad for codfw depool today
[10:12:09] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: Remove outdated invert-redis-sessions cookbook - https://phabricator.wikimedia.org/T329020 (10Clement_Goubert) 05In progress→03Resolved Wikitech page updated https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacente...
[10:12:19] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[10:13:52] <logmsgbot>	 !log oblivian@cumin2002 END (FAIL) - Cookbook sre.discovery.datacenter-route (exit_code=93) pool all active/active services in eqiad: Pooling eqiad for codfw depool today
[10:15:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:16:32] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team (Seen), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10taavi)
[10:17:04] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10taavi)
[10:17:13] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[10:17:23] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10taavi)
[10:17:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast1003.wikimedia.org with OS bullseye
[10:17:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast1003.wikimedia.org with OS bullseye completed: - bast1003 (**PASS**)   - Downtimed on Icinga/Alertm...
[10:18:43] <wikibugs>	 (03PS12) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[10:19:29] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in eqiad: Pooling eqiad for codfw depool today
[10:19:39] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey)
[10:19:51] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) pool all active/active services in eqiad: Pooling eqiad for codfw depool today
[10:22:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron)
[10:23:31] <wikibugs>	 (03PS13) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[10:25:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:25:49] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jijiki)
[10:26:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff)
[10:26:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[10:26:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete
[10:26:50] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10taavi)
[10:34:58] <wikibugs>	 (03PS1) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684)
[10:35:06] <wikibugs>	 (03PS14) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[10:35:55] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 3 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) I have created a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/887292 | CR ]] to force  full...
[10:38:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:39:53] <wikibugs>	 (03Merged) 10jenkins-bot: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[10:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:46:43] <wikibugs>	 (03PS5) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915)
[10:47:33] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:47:48] <wikibugs>	 (03CR) 10Volans: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[10:48:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915) (owner: 10Nicolas Fraison)
[10:48:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[10:49:01] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons.
[10:50:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons.
[10:55:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but seems like there is a lot of repetition between the two reimage cookbooks would be nice to have the shared code in one place.  ho" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[10:56:10] <wikibugs>	 (03PS1) 10Majavah: openstack: nova: restrict rebuilds to admins [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404)
[10:57:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus)
[10:58:47] <wikibugs>	 (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1100)
[11:02:47] <wikibugs>	 (03CR) 10Volans: "reply to comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[11:04:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign installserver role to install6002 [puppet] - 10https://gerrit.wikimedia.org/r/887302 (https://phabricator.wikimedia.org/T327867)
[11:05:13] <wikibugs>	 (03PS2) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684)
[11:06:01] <wikibugs>	 (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[11:06:35] <wikibugs>	 (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[11:07:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Assign installserver role to install6002 [puppet] - 10https://gerrit.wikimedia.org/r/887302 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[11:07:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.ganeti.reimage: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[11:10:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887282 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez)
[11:15:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DONT MERGE YET. Still testing the effects in codfw1dev." [puppet] - 10https://gerrit.wikimedia.org/r/887288 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez)
[11:17:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Point webproxy in drmrs to install6002 [dns] - 10https://gerrit.wikimedia.org/r/887304 (https://phabricator.wikimedia.org/T327867)
[11:19:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Update tftp server settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/887305 (https://phabricator.wikimedia.org/T327867)
[11:20:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:55] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2044.codfw.wmnet with OS bullseye
[11:29:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1041.eqiad.wmnet with OS bullseye
[11:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point webproxy in drmrs to install6002 [dns] - 10https://gerrit.wikimedia.org/r/887304 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[11:31:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, just a minor nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[11:31:41] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond)
[11:33:14] <moritzm>	 !log installing imagemagick security updates on buster
[11:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:55] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage
[11:40:58] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage
[11:41:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage
[11:41:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] postgrers::user::hba: drop hba_label and use title instead [puppet] - 10https://gerrit.wikimedia.org/r/886912 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[11:44:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage
[11:52:17] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons.
[11:52:25] <wikibugs>	 (03PS3) 10Aklapper: mediawiki: Better error page layout on mobile devices [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247) (owner: 10Phantom42)
[11:56:36] <wikibugs>	 (03PS10) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783)
[11:56:46] <marostegui>	 !log Install 10.4.28 on db1152 T329011
[11:56:47] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2044.codfw.wmnet with OS bullseye
[11:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:50] <stashbot>	 T329011: Compile and package MariaDB 10.4.28 and 10.6.12 - https://phabricator.wikimedia.org/T329011
[11:58:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney)
[11:58:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney)
[11:59:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney) Thanks @MoritzMuehlenhoff, I can roll this into the work to unify the asw configs across the board.  We have it automated for similar switches (lsw, cloudsw) el...
[12:00:07] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1041.eqiad.wmnet with OS bullseye
[12:04:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update tftp server settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/887305 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[12:04:27] <wikibugs>	 (03PS11) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783)
[12:07:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[12:08:48] <wikibugs>	 (03PS8) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:13:02] <wikibugs>	 (03CR) 10Ladsgroup: "I would really prefer to get T326147 done before this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[12:17:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm6001.drmrs.wmnet
[12:17:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:17:56] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron)
[12:19:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm6001.drmrs.wmnet - jmm@cumin2002"
[12:20:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me too." [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison)
[12:20:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm6001.drmrs.wmnet - jmm@cumin2002"
[12:20:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:20:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache testvm6001.drmrs.wmnet on all recursors
[12:20:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm6001.drmrs.wmnet on all recursors
[12:21:12] <wikibugs>	 (03PS9) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:24:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925
[12:24:46] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[12:24:58] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925
[12:25:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:25:52] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[12:26:22] <wikibugs>	 (03PS1) 10Ladsgroup: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932)
[12:26:34] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh)
[12:28:11] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:28:19] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:28:33] <sukhe>	 ^ expected, doh2001
[12:28:45] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:28:55] <vgutierrez>	 !log depooling authdns2001 - T327925
[12:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:03] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[12:29:21] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:23] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:29:29] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert)
[12:29:37] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison)
[12:29:47] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[12:30:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm6001.drmrs.wmnet
[12:31:05] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:17] <wikibugs>	 (03PS10) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:31:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Shizhao) 05Open→03Invalid >>! 在T328889#8591463中，@Dzahn写道： > Maybe it was fixed since this ticket was...
[12:33:06] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[12:35:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:35:57] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268)
[12:38:11] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+2] Add information and command rights on icinga to Nicolas Fraison [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison)
[12:38:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Puppet references to theemin [puppet] - 10https://gerrit.wikimedia.org/r/887309
[12:39:18] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez)
[12:39:54] <wikibugs>	 (03PS11) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:40:52] <wikibugs>	 (03PS13) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[12:41:04] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) To depool all services in codfw we will just need to run:  ` sudo cookbook sre.discovery.datacenter-route --reason 'T327925' depool codfw `  from on...
[12:41:43] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) @Joe @akosiaris I assume we'll depool codfw for this one too?
[12:42:57] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[12:43:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński)
[12:43:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867)
[12:43:53] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:44:47] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) See https://commons.wikimedia.org/w/index.php?title=Commons%3AVillage_pump&diff=prev&oldid=7306822...
[12:46:11] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) Please note: this won't depool `docker-registry`, which will still be active in codfw for the duration of the maintenance.
[12:48:13] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[12:50:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references to theemin [puppet] - 10https://gerrit.wikimedia.org/r/887309 (owner: 10Muehlenhoff)
[12:50:41] <wikibugs>	 (03PS12) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:51:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39426/console" [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[12:53:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867)
[12:54:35] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741)
[12:55:08] <wikibugs>	 (03PS14) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[12:55:10] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268)
[12:55:28] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert)
[12:56:58] <wikibugs>	 (03PS13) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[12:58:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[12:58:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Yeah, makes sense to make other wmf.21 patches work I guess." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[12:58:27] <wikibugs>	 (03PS15) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[12:58:33] <wikibugs>	 (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[13:00:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:46] <wikibugs>	 (03CR) 10EoghanGaffney: "Hey Filippo, wondering if you have an opinion on this -- this option should work, but I don't know if we'd be better exploring another app" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:02:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867)
[13:03:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:48] <jbond>	 !log diable puppet in codfw, ulsfo and esams for switch upgrade T327925
[13:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:52] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[13:06:19] <wikibugs>	 (03PS2) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867)
[13:08:25] <wikibugs>	 (03PS3) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867)
[13:10:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[13:11:14] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[13:11:17] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route depool all active/active services in codfw: T327925
[13:11:21] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[13:12:59] <jbond>	 !log enable puppet in codfw, ulsfo and esams to allow depools post  switch upgrade T327925
[13:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:55] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:15:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:38] <wikibugs>	 (03PS1) 10Jbond: wmnet: swap esams and esqin for the puppet CNAME [dns] - 10https://gerrit.wikimedia.org/r/887314
[13:17:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/887314 (owner: 10Jbond)
[13:19:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:28] <wikibugs>	 (03PS14) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[13:23:30] <wikibugs>	 (03PS1) 10Jbond: postgresql::user: fix filter statement [puppet] - 10https://gerrit.wikimedia.org/r/887316
[13:24:25] <icinga-wm>	 PROBLEM - TFTP service on install6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd
[13:24:57] <icinga-wm>	 PROBLEM - HTTP on install6001 is CRITICAL: connect to address 185.15.58.7 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers
[13:25:23] <icinga-wm>	 PROBLEM - Squid on install6001 is CRITICAL: connect to address 185.15.58.7 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:26:10] <moritzm>	 ^ expected, replaced by install6002
[13:26:50] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo)
[13:27:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] postgresql::user: fix filter statement [puppet] - 10https://gerrit.wikimedia.org/r/887316 (owner: 10Jbond)
[13:27:14] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569)
[13:28:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[13:29:38] <wikibugs>	 (03PS15) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783)
[13:30:18] <wikibugs>	 (03PS2) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925)
[13:30:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39431/console" [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[13:31:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[13:31:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 199 hosts with reason: codfw row A upgrade
[13:31:50] <vgutierrez>	 !log depool codfw edge site - T327925
[13:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:55] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[13:32:43] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) depool all active/active services in codfw: T327925
[13:33:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[13:33:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:53] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) For the record, full row hosts downtime done with: `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -t T327925 'P{P:netbox::...
[13:33:59] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 199 hosts with reason: codfw row A upgrade
[13:34:20] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 for 2:00:00 on 199 host(s...
[13:36:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [dns] - 10https://gerrit.wikimedia.org/r/887314 (owner: 10Jbond)
[13:37:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[13:37:43] <wikibugs>	 (03PS2) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569)
[13:41:39] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "I think in this case it will be fine to just deploy this.  eventgate-analytics-external should have a short lived cache this stream config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[13:45:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:06] <wikibugs>	 (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond)
[13:49:01] <Lucas_WMDE>	 jouncebot: nowandnext
[13:49:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 10 minute(s)
[13:49:01] <jouncebot>	 In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400)
[13:49:01] <jouncebot>	 In 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400)
[13:49:25] <Lucas_WMDE>	 I’ll already backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/886981 to save a bit of time during the window
[13:49:28] <Lucas_WMDE>	 should be a no-op
[13:49:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[13:49:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[13:53:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:54:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @Jhancock.wm can you please check the switch port for mw2423, it looks like i have already a server connected to port 41.  Thanks
[13:54:25] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:37] <wikibugs>	 (03PS11) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729)
[13:54:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[13:54:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2422 and 24 DNS - pt1979@cumin2002"
[13:55:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[13:55:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:55:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2422 and 24 DNS - pt1979@cumin2002"
[13:55:52] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:56:15] <Emperor>	 !log depool ms-fe2009 T327925
[13:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:19] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[13:56:23] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:56:29] <XioNoX>	 Lucas_WMDE: We have a large maintenance starting in 5min, it shouldn't be more than 30min downtime, is it ok to postpone any deployment?
[13:56:55] <Lucas_WMDE>	 uh
[13:57:07] <Lucas_WMDE>	 does that mean the UTC afternoon backport window won’t be happening?
[13:57:40] <Lucas_WMDE>	 my backport on its own isn’t important, but MatmaRex wanted to backport a DiscussionTools fix that depends on it
[13:57:45] <MatmaRex>	 hi
[13:57:53] <MatmaRex>	 :o
[13:58:03] <MatmaRex>	 is something down?
[13:58:16] <Lucas_WMDE>	 it’s about to be, apparently
[13:58:37] <Lucas_WMDE>	 I assume this is the row switch stuff, but I didn’t know that was going to overlap with the backport window
[13:58:40] <XioNoX>	 that's the main task for the maintenance https://phabricator.wikimedia.org/T327925
[13:58:47] <XioNoX>	 yeah I didn't know neither
[13:59:04] <MatmaRex>	 i will be sad, but i can backport later too
[13:59:05] <wikibugs>	 (03PS2) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[13:59:08] <MatmaRex>	 there wasn't anything on the calendar :/
[13:59:24] <Lucas_WMDE>	 I’ll remove my +2 then
[13:59:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[13:59:26] <Lucas_WMDE>	 and cancel the deploy
[13:59:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Rescinding deployment, T327925 is about to happen." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[13:59:57] <XioNoX>	 !log disable puppet in ulsfo/esams/codfw for codfw row A switch upgrade - T327925
[13:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400).
[14:00:04] <jouncebot>	 ottomata and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400)
[14:00:09] <XioNoX>	 MatmaRex: which calendar? I can try to update it there as well
[14:00:11] * urbanecm waves
[14:00:17] <wikibugs>	 (03PS5) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017)
[14:00:23] <Lucas_WMDE>	 XioNoX: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400
[14:00:30] <ottomata>	 o/
[14:00:52] <urbanecm>	 but it looks like the window's postponed, and that Lucas_WMDE is around to staff it
[14:00:57] <Lucas_WMDE>	 I’m not around, actually
[14:00:58] <MatmaRex>	 XioNoX: https://wikitech.wikimedia.org/wiki/Deployments
[14:01:01] <Lucas_WMDE>	 I’m in a meeting now
[14:01:10] <Lucas_WMDE>	 but I think the whole window isn’t happening
[14:01:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:01:27] <urbanecm>	 ah
[14:01:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:01:46] <Lucas_WMDE>	 and whoever scheduled that switch maintenance, next time please put it in the deployment calendar?
[14:02:17] <ottomata>	 oh, backport window not happening?
[14:02:30] <Lucas_WMDE>	 probably not, due to https://phabricator.wikimedia.org/T327925 being scheduled for the same time
[14:02:36] <Lucas_WMDE>	 wtihout, apparently, anybody realising it until five minutes ago
[14:02:44] <urbanecm>	 ottomata: XioNoX said something about a switch maintenance
[14:02:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, wording nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:03:03] <claime>	 Basically, all of row A in codfw will be unreachable
[14:03:26] <claime>	 To facilitate the maintenance, we decided to depool codfw completely
[14:03:51] <Lucas_WMDE>	 if the switch maintenance only takes 30 minutes, someone™ could in theory deploy at least part of the changes afterwards
[14:03:54] <ottomata>	 hm, okay.  does that mean we can't deploy?   
[14:03:57] <Lucas_WMDE>	 (but not me, I’ll be in said meeting until the end of the hour)
[14:04:03] <ottomata>	 someone special hopefully :)
[14:04:16] <ottomata>	 iiuc deplooing shouldn't affect scap deployments?
[14:04:26] <urbanecm>	 depends on how it's done
[14:04:32] <ottomata>	 and none of the servers listed in that ticket are mw app servers
[14:04:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:04:42] <ottomata>	 anyway, prob best to wait
[14:04:42] <ottomata>	 okay
[14:04:53] <ottomata>	 would love it if a deployer could help when the switch thing is done
[14:05:08] <claime>	 There's mw[2291-2309,2377-2411] and parse[2001-2005]
[14:05:21] <ottomata>	 OH, oops
[14:05:21] <XioNoX>	 yeah in theory the deployment could happen, but it's mostly to minimize the number of changes and moving parts in the same time window
[14:05:23] <ottomata>	 okay.  missed that
[14:05:26] <ottomata>	 yeah
[14:05:28] <urbanecm>	 gotcha
[14:05:29] <ottomata>	 let's wait
[14:05:44] <urbanecm>	 could someone ping me after said maintenance finishes? i'll try to deploy some patches of the backport window at least
[14:06:23] <ottomata>	 yes thanks i'll do that urbanecm 
[14:06:29] <urbanecm>	 ty
[14:06:29] <wikibugs>	 (03PS3) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569)
[14:06:48] <Lucas_WMDE>	 thanks
[14:06:51] <XioNoX>	 !log asw-a-codfw> request system reboot all-members  - T327925
[14:06:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:54] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[14:07:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 backport aborted:  (duration: 17m 46s)
[14:07:49] <Lucas_WMDE>	 (`scap backport` didn’t finish on its own after the core gate-and-submit finished but didn’t merge, so I Ctrl+Ced it)
[14:07:54] * Lucas_WMDE done
[14:08:04] <claime>	 ack Lucas_WMDE, thanks
[14:08:45] <jbond>	 !log disable puppet in codfw, uslfo, esams for switch upgrade T327925
[14:08:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[14:09:03] <icinga-wm>	 PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:23] <icinga-wm>	 PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:23] <icinga-wm>	 PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:10:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:01] <volans>	 XioNoX: expected  ^^^^?
[14:10:06] <volans>	 the asw-X
[14:10:18] <jinxer-wm>	 (ProbeDown) firing: Service thanos-web:443 has failed probes (http_thanos-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-web:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:21] <XioNoX>	 volans: it's their mgmt interface
[14:10:25] <jbond>	 restbase i saw an alert for earlier so may be unrelated
[14:10:26] <volans>	 I guess it's just the mgmt and that the check was quicker than the dependency in icinga
[14:10:38] <volans>	 page acked
[14:10:48] <XioNoX>	 volans: I downtimed mr1 but looks like icinga considred them as down before seeing mr1 as down, so parent/child didn't kick in
[14:11:03] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:11:05] <volans>	 what about thanos?
[14:11:07] <icinga-wm>	 PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:11:35] <icinga-wm>	 PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:11:43] <volans>	 godog: should we worry about thanos?
[14:12:09] <godog>	 volans: no, I'll pool another host
[14:12:22] <godog>	 thanos-web is the web interface, not a huge deal and it is active/active
[14:12:29] <volans>	 ack
[14:12:32] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[14:13:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2391.codfw.wmnet, mw2378.codfw.wmnet, mw2387.codfw.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2295.codfw.wmnet, mw2302.codfw.wmnet, mw2293.codfw.wmnet, mw2298.codfw.wmnet, mw2372.codfw.w
[14:13:05] <icinga-wm>	 2299.codfw.wmnet, mw2400.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:13:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:13:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:13:20] <claime>	 errr
[14:13:23] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:13:25] <icinga-wm>	 PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:13:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2407.codfw.wmnet, mw2391.codfw.wmnet, mw2408.codfw.wmnet, mw2389.codfw.wmnet, mw2384.codfw.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2295.codfw.wmnet, mw2298.codfw.wmnet, mw2356.codfw.wmnet, mw2402.codfw.w
[14:13:31] <icinga-wm>	 2299.codfw.wmnet, mw2294.codfw.wmnet, mw2405.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:13:44] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2001.codfw.wmnet,service=thanos-web
[14:13:51] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web
[14:13:55] <volans>	 claime: weren't the mediawiki hosts depooled?
[14:13:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-staging2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:13:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:03] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:03] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:05] <claime>	 Well  they were supposed to be
[14:14:07] <icinga-wm>	 PROBLEM - MariaDB Replica IO: backup1-codfw on db2184 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2183.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2183.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:14:07] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:14:35] <marostegui>	 jynus: ^ wnat me to take care of that?
[14:14:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[14:14:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[14:14:54] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:15:04] <jynus>	 marostegui: I thought you had handled it
[14:15:06] <jynus>	 I can
[14:15:16] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10akosiaris) >>! In T327991#8593396, @Marostegui wrote: > @Joe @akosiaris I assume we'll depool codfw for this one too?  Yeah, as a team we are similarly a...
[14:15:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:19] <godog>	 the thanos compact alert is fine
[14:15:23] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-web:443 has failed probes (http_thanos-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-web:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:15:40] <marostegui>	 jynus: I didn't see that one the list of things
[14:15:41] <volans>	 thanks godog 
[14:15:45] <jinxer-wm>	 (JobUnavailable) firing: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:09] <XioNoX>	 godog: should it page if it's not a huge deal? (thanos-web)
[14:16:10] <jynus>	 marostegui: maybe another case of desync between setup and maintenance
[14:16:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:05] <marostegui>	 jynus: yeah, i didn't see that one at https://phabricator.wikimedia.org/T327925 I saw db2183 but I didn't know it had that replica
[14:17:19] <godog>	 XioNoX: yeah that's fair, probably we don't need to page, I'll send a review after the maint
[14:17:43] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:18:27] <claime>	 Depooling mw appservers
[14:18:30] <jynus>	 no prob, I have downtimed it and will check it afterwards
[14:18:36] <volans>	 claime: thx
[14:18:37] <marostegui>	 thanks
[14:18:46] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=api_appserver
[14:18:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:18:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:18:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:19:04] <jinxer-wm>	 (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:19:35] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=jobrunner
[14:19:45] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=appserver
[14:19:48] <wikibugs>	 (03PS1) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321
[14:19:59] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for the quick round of reviews folks! I think that we are ready to merge?" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[14:19:59] <icinga-wm>	 RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.23 ms
[14:20:03] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 33%, RTA = 33.76 ms
[14:20:03] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.57 ms
[14:20:07] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01001 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:20:17] <icinga-wm>	 RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[14:20:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:20:39] <icinga-wm>	 RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms
[14:20:43] <icinga-wm>	 RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:20:45] <jinxer-wm>	 (JobUnavailable) firing: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:47] <wikibugs>	 (03CR) 10EoghanGaffney: "Massively simplified approach to add the otrs logs to kafka!" [puppet] - 10https://gerrit.wikimedia.org/r/887321 (owner: 10EoghanGaffney)
[14:20:47] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:21:23] <icinga-wm>	 RECOVERY - MariaDB Replica IO: backup1-codfw on db2184 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:21:30] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid
[14:21:46] <wikibugs>	 (03Abandoned) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[14:23:01] <icinga-wm>	 RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms
[14:23:19] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:23:34] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:23:40] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:23:45] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:23:47] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:57] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:24:03] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-staging2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:07] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:12] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:16] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (5) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:19] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.28:443, 10.2.1.22:443, 10.2.1.26:443, 10.2.1.1:443]) https://wikitech.wikimedia.org/wiki/PyBal
[14:24:19] <jinxer-wm>	 (ProbeDown) firing: (6) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:24:22] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[14:24:24] <XioNoX>	 Lucas_WMDE, MatmaRex, the upgrade itself is successful, we're doing customary checks
[14:24:39] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Check console cable for asw-a2-codfw - https://phabricator.wikimedia.org/T329055 (10cmooney) p:05Triage→03Low
[14:24:43] <MatmaRex>	 nice, thanks for the note
[14:24:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[14:24:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[14:24:54] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[14:24:59] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[14:25:00] <wikibugs>	 (03PS4) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576)
[14:25:04] <jinxer-wm>	 (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[14:25:18] <wikibugs>	 (03PS2) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759)
[14:25:25] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:25:34] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[14:25:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:25:45] <jinxer-wm>	 (JobUnavailable) resolved: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:26:37] <claime>	 !log depooled appserver, api_appserver, jobrunner, parsoid - T327925
[14:26:39] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.28:443, 10.2.1.22:443, 10.2.1.1:443, 10.2.1.26:443]) https://wikitech.wikimedia.org/wiki/PyBal
[14:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:40] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[14:27:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[14:27:54] <wikibugs>	 (03PS3) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759)
[14:27:59] <jbond>	 !log enable puppet in codfw, uslfo, esams post switch upgrade T327925
[14:28:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:28] <jinxer-wm>	 (ProbeDown) firing: (5) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:28:34] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[14:28:38] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:28:41] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:28:41] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:28:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:29:04] <jinxer-wm>	 (ProbeDown) firing: (7) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[14:30:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:20] <Emperor>	 !log pool ms-fe2009 (codfw as a whole still depooled) T327925
[14:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:23] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[14:34:12] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[14:34:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney)
[14:35:01] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid
[14:35:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[14:35:15] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver
[14:35:28] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=jobrunner
[14:35:43] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002003 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:35:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:02] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver
[14:36:20] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[14:36:25] <claime>	 !log repooled appserver, api_appserver, jobrunner, parsoid - T327925
[14:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:48] <jinxer-wm>	 (ProbeDown) resolved: (4) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:38:51] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984
[14:39:04] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925)
[14:39:04] <jinxer-wm>	 (ProbeDown) resolved: (4) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:40:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:40:13] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:40:21] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2001.codfw.wmnet,service=thanos-web
[14:40:33] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web
[14:41:31] <ottomata>	 claime: following at 10%, please lemme know when okay to deploy :)
[14:41:43] <claime>	 ottomata: will do
[14:42:41] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:42:41] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 177, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:42:43] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:42:43] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:42:46] <sukhe>	 wb doh2001
[14:43:32] <wikibugs>	 (03PS3) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[14:46:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:28] <TheresNoTime>	 o/, noted T329056 just as whatever all that ^ was, related?
[14:46:29] <stashbot>	 T329056: beta-code-update-eqiad: FATAL: java.io.IOException: Unexpected termination of the channel - https://phabricator.wikimedia.org/T329056
[14:46:49] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in codfw: T327925
[14:46:52] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[14:49:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Jclark-ctr) @Marostegui  can drive be swapped as soon as it arrives?  eta is today but unsure what time it will arrive
[14:49:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Marostegui) @Jclark-ctr yes, you can do it whenever you want.
[14:51:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:15] <moritzm>	 !log adding nfraison to pwstore T328915
[14:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:20] <wikibugs>	 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10ayounsi) p:05Triage→03High
[14:55:47] <wikibugs>	 (03PS4) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[14:57:11] <wikibugs>	 (03PS1) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817)
[14:57:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325
[14:58:10] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[14:59:04] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[14:59:05] <marostegui>	 !log dbmaint deploy schema change on s6 T328828
[14:59:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:08] <stashbot>	 T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828
[15:00:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:43] <urbanecm>	 hi, what's the status of the maintenance please?
[15:00:52] <vgutierrez>	 !log restart pybal in lvs2009 - T327925
[15:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:56] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[15:00:57] <volans>	 we're repooling things in codfw
[15:01:05] <marostegui>	 !log dbmaint deploy schema change on s6 T328807
[15:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:08] <stashbot>	 T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807
[15:01:17] <volans>	 it should not take too much longer urbanecm 
[15:01:21] <urbanecm>	 ack
[15:02:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:02:59] <wikibugs>	 (03PS5) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:03:28] <wikibugs>	 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10cmooney) Seems to have gone down last Monday week (Jan 30th) ` Jan 30 17:32:20  re0.cr1-codfw mib2d[31964]: SNMP_TRAP_LINK_DOWN: ifIndex 647, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/0/1:1 `  Perhaps some ca...
[15:04:37] <vgutierrez>	 !log restart pybal in lvs2010 - T327925
[15:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:08] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue - see T329059 - The acknowledgement expires at: 2023-02-13 10:04:30. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:05:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:18] <wikibugs>	 (03CR) 10Ladsgroup: "generally looks good. One note." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui)
[15:05:26] <marostegui>	 !log dbmaint deploy schema change on s8 T328807 T328828
[15:05:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:30] <stashbot>	 T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828
[15:05:38] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue - see T329059 - The acknowledgement expires at: 2023-02-13 10:05:11. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:05:43] <wikibugs>	 (03PS6) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:06:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:06:10] <wikibugs>	 (03CR) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui)
[15:06:36] <wikibugs>	 (03PS2) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817)
[15:06:46] <wikibugs>	 (03CR) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui)
[15:07:56] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) pool all active/active services in codfw: T327925
[15:07:59] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[15:08:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[15:09:02] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327925
[15:09:03] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors
[15:09:07] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors
[15:10:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui)
[15:10:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui)
[15:11:14] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[15:11:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[15:11:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[15:11:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[15:11:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez)
[15:11:53] <wikibugs>	 (03PS7) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:12:07] <claime>	 ottomata: urbanecm: You should be ok to go
[15:12:18] <urbanecm>	 ty
[15:12:21] <vgutierrez>	 !log repool codfw edge site - T327925
[15:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:45] <urbanecm>	 MatmaRex: if you're still around, do you want to go ahead with the deployment?
[15:13:06] <MatmaRex>	 oh
[15:13:08] <wikibugs>	 (03CR) 10Urbanecm: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[15:13:12] <MatmaRex>	 sure. i actually have some time
[15:13:18] <urbanecm>	 ottomata: see my comment on the config patch please
[15:13:30] <MatmaRex>	 i was about the reschedule it. thanks :)
[15:13:37] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2045.codfw.wmnet with OS bullseye
[15:13:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1043.eqiad.wmnet with OS bullseye
[15:13:55] <urbanecm>	 there's nothing officially scheduled in the calendar, so it should be fine to go ahead
[15:14:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński)
[15:14:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[15:14:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[15:14:06] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327925
[15:14:09] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[15:14:18] <wikibugs>	 (03PS2) 10Urbanecm: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński)
[15:14:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński)
[15:14:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff)
[15:15:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński)
[15:15:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński)
[15:15:37] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]]
[15:15:40] <stashbot>	 T327456: [Config Change] Add Page Frame to beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T327456
[15:15:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:49] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet
[15:17:32] <logmsgbot>	 !log urbanecm@deploy1002 matmarex and urbanecm: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[15:17:44] <urbanecm>	 MatmaRex: please test at mwdebug1001 and let me know how it goes :)
[15:18:39] <MatmaRex>	 looking
[15:19:45] <MatmaRex>	 urbanecm: works as expected
[15:20:11] <urbanecm>	 ty, syncing
[15:20:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet
[15:22:19] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:22:26] <ottomata>	 urbanecm:  looking
[15:22:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:58] <wikibugs>	 (03PS8) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:23:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43760 and previous config saved to /var/cache/conftool/dbconfig/20230207-152337-root.json
[15:25:35] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage
[15:25:42] <wikibugs>	 (03PS9) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:25:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[15:26:12] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Oy." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[15:26:16] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]] (duration: 10m 39s)
[15:26:19] <stashbot>	 T327456: [Config Change] Add Page Frame to beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T327456
[15:26:26] <urbanecm>	 MatmaRex: patch should be live (backport waiting on CI)
[15:27:35] <wikibugs>	 (03PS10) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[15:28:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage
[15:28:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey)
[15:28:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Clement_Goubert)
[15:29:07] <wikibugs>	 (03Abandoned) 10Jdrewniak: Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[15:29:15] <wikibugs>	 (03Abandoned) 10Jdrewniak: Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak)
[15:29:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[15:29:27] <wikibugs>	 (03PS1) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035)
[15:29:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) (owner: 10Jelto)
[15:29:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage
[15:29:51] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert)
[15:30:16] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert)
[15:30:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:40] <wikibugs>	 (03Merged) 10jenkins-bot: Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński)
[15:30:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond)
[15:30:44] <wikibugs>	 (03Merged) 10jenkins-bot: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński)
[15:30:59] <wikibugs>	 (03PS6) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017)
[15:31:19] <wikibugs>	 (03CR) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[15:32:00] <wikibugs>	 (03PS2) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035)
[15:32:50] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage
[15:33:04] <icinga-wm>	 PROBLEM - MariaDB Replica IO: es5 on es2024 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1593, Errmsg: Fatal error: Failed to run after_read_event hook https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:33:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:47] <jynus>	 ^ marostegui
[15:33:48] <urbanecm>	 MatmaRex: your backport is at mwdebug1001 now, can you check please?
[15:33:58] <urbanecm>	 not using scap backport this time because of T323277
[15:33:59] <stashbot>	 T323277: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277
[15:34:17] <marostegui>	 checking 
[15:34:27] <MatmaRex>	 looking
[15:34:35] <jynus>	 same thing that happened to es2020 last time
[15:36:08] <icinga-wm>	 RECOVERY - MariaDB Replica IO: es5 on es2024 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:36:18] <wikibugs>	 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) Any reason why it went down last Monday and we are just seeing it now today after a week?
[15:36:51] <marostegui>	 Looks related to semi sync from what I can see
[15:36:54] <marostegui>	 It is fixed now though
[15:37:38] <MatmaRex>	 urbanecm: looks good. sorry about the delay
[15:37:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:38:18] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert)
[15:38:22] <ottomata>	 urbanecm: urbanecm fixed my cofnig patch.  also...i added one more eventbus patch to deploy, if that is okay.
[15:38:28] <ottomata>	 was just reported to me and i tthink its unbreak now.
[15:38:32] <ottomata>	 unrelated to the other two patches.
[15:38:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43761 and previous config saved to /var/cache/conftool/dbconfig/20230207-153842-root.json
[15:38:55] <urbanecm>	 ottomata: ack, thanks
[15:38:59] <urbanecm>	 MatmaRex: no problem, syncing
[15:39:29] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) 05Open→03Resolved a:03ayounsi The upgrade was smooth, ~15min hard downtime. No user impact, all the depools did their job. There was some...
[15:39:38] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: 20a79c55b7073e791e297a5389fa66819f596178: Don't add custom attributes in unwrapParsoidSections() (T328268)
[15:39:41] <stashbot>	 T328268: Dirty diffs in headings in edits made with reply tool - https://phabricator.wikimedia.org/T328268
[15:40:06] <wikibugs>	 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10ayounsi) We're not looking at Icinga often enough :)
[15:40:37] <wikibugs>	 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) make sense
[15:40:40] <urbanecm>	 ottomata: eh, i accidentally merged the other patch in master :D. hopefully we don't break anything with it
[15:41:04] <wikibugs>	 (03PS1) 10Urbanecm: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064)
[15:41:17] <wikibugs>	 (03PS1) 10Urbanecm: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064)
[15:41:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[15:41:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[15:41:53] <ottomata>	 urbanecm:  none of these patches should break anything
[15:42:05] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: Add rollback() method and improve logging (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[15:42:06] <ottomata>	 the latest one i added will fix somethign that I accidentally broke back in october. :(
[15:42:11] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[15:42:12] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert)
[15:42:20] <urbanecm>	 :(
[15:42:34] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) p:05Triage→03High
[15:42:56] <wikibugs>	 (03PS1) 10Urbanecm: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017)
[15:43:06] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1043.eqiad.wmnet with OS bullseye
[15:43:08] <wikibugs>	 (03PS1) 10Urbanecm: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017)
[15:45:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[15:46:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:13] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: 20a79c55b7073e791e297a5389fa66819f596178: Don't add custom attributes in unwrapParsoidSections() (T328268) (duration: 07m 34s)
[15:47:17] <stashbot>	 T328268: Dirty diffs in headings in edits made with reply tool - https://phabricator.wikimedia.org/T328268
[15:47:35] <urbanecm>	 MatmaRex: backport's live
[15:47:43] <MatmaRex>	 thanks
[15:47:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[15:47:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[15:47:59] <urbanecm>	 np
[15:48:33] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2045.codfw.wmnet with OS bullseye
[15:50:01] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite)
[15:51:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:44] <wikibugs>	 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10colewhite)
[15:52:49] <wikibugs>	 (03PS6) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[15:53:42] <moritzm>	 !log installing tiff security updates
[15:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43762 and previous config saved to /var/cache/conftool/dbconfig/20230207-155347-root.json
[15:54:03] <wikibugs>	 (03PS2) 10Jcrespo: Revert "dbbackups: Delay codfw es (db content) backups by one day" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925)
[15:54:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Delay codfw es (db content) backups by one day" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo)
[15:55:12] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney)
[15:55:41] <wikibugs>	 (03PS3) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035)
[15:56:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[15:57:20] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:45] <wikibugs>	 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) 05Open→03Resolved cable unplugged `` papaul@re0.cr2-codfw> show interfaces terse xe-1/0/1:2 Interface               Admin Link Proto    Local                 Remote xe-1/0/1:2              up    up xe...
[15:59:50] <wikibugs>	 (03Merged) 10jenkins-bot: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[15:59:52] <wikibugs>	 (03Merged) 10jenkins-bot: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm)
[16:00:22] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]]
[16:00:26] <stashbot>	 T329064: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064
[16:02:13] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[16:02:31] <urbanecm>	 ottomata: can you please check the page-undelete hook at mwdebug, if possible?
[16:03:30] <wikibugs>	 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 4 others: Add SPF records for gitlab.wikimedia.org - https://phabricator.wikimedia.org/T328642 (10eoghan) 05Open→03Resolved I've deployed the softfail records and checked that they're in place:  ` ❯ for i in 0 1 2; do    ns=ns${i}.wikimedia.org...
[16:03:44] <wikibugs>	 (03PS34) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[16:04:04] <wikibugs>	 (03PS53) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[16:04:06] <wikibugs>	 (03PS14) 10Raymond Ndibe: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro)
[16:07:12] <ottomata>	 urbanecm:  i can try... i need to be able to undelete a page to do that, attempting...
[16:07:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[16:07:42] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:08:25] <ottomata>	 urbanecm:  i think i don't have permissions on testwiki to delete and undelete, do you?
[16:08:29] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-logging2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[16:08:41] * volans here
[16:08:44] <urbanecm>	 ottomata: yes. i can also give you sysop permissions at test.wikipedia if you tell me your username
[16:08:48] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-logging2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[16:08:50] <ottomata>	 Ottomata
[16:08:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43763 and previous config saved to /var/cache/conftool/dbconfig/20230207-160852-root.json
[16:09:00] <volans>	 acked page
[16:09:12] <icinga-wm>	 PROBLEM - Check systemd state on kafka-logging2001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:21] * jhathaway here as well
[16:09:27] <Emperor>	 volans: need any assistance?
[16:09:27] <rzl>	 here
[16:09:43] <volans>	 kafka.service: Main process exited, code=exited, status=143/n/a
[16:09:49] <urbanecm>	 ottomata: done
[16:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:10:37] <ottomata>	 it works urbanecm thank you.
[16:10:39] <volans>	 something stopped kafka?
[16:10:39] <volans>	 Feb  7 12:14:03 kafka-logging2001 systemd[1]: Stopping Kafka Broker...
[16:10:44] <ottomata>	 proceed with deploy
[16:10:47] <urbanecm>	 thanks
[16:10:58] <urbanecm>	 before i proceed, i see something broke :-/
[16:11:03] <ottomata>	 oh
[16:11:04] <ottomata>	 ?
[16:11:05] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[16:11:24] <moritzm>	 volans: it has puppet stopped with a message by Keith
[16:11:25] <jhathaway>	 volans: yeah looks like it
[16:11:29] <jhathaway>	 ah!
[16:11:33] <moritzm>	 this is just expired downtime I expect
[16:11:41] <volans>	 herron: ^^^^
[16:11:48] <volans>	 is that WIP on your side?
[16:11:49] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro)
[16:11:51] <rzl>	 moritzm: good catch, thanks
[16:11:52] <moritzm>	 "switch maintenance, kafka stopped --herron"
[16:12:03] <wikibugs>	 (03CR) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro)
[16:12:14] <urbanecm>	 to SREs: i've a mediawiki deploy in progress, is it okay to let it finish?
[16:12:19] <volans>	 urbanecm: yes
[16:12:21] <urbanecm>	 ok
[16:12:23] <urbanecm>	 thanks
[16:12:26] <urbanecm>	 ottomata: proceeding
[16:13:00] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 78 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003
[16:13:24] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2002 is CRITICAL: 80 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002
[16:14:02] <volans>	 the procedure in the task says
[16:14:02] <volans>	 start kafka service, confirm kafka logging dashboard returns green
[16:14:05] <volans>	 asking in o11y
[16:14:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1045.eqiad.wmnet with OS bullseye
[16:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:15:06] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2046.codfw.wmnet with OS bullseye
[16:15:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm)
[16:16:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm)
[16:17:14] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[16:17:22] <wikibugs>	 (03PS9) 10Superpes15: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047)
[16:17:51] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro)
[16:18:07] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]] (duration: 17m 44s)
[16:18:10] <stashbot>	 T329064: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064
[16:18:26] <urbanecm>	 ottomata: first patch's live everywhere now
[16:18:32] <urbanecm>	 waiting on CI for the other one
[16:18:37] <wikibugs>	 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Jhancock.wm) We did some more troubleshooting and it looks like the slot for DIMM_B4 is bad. This may need a MB replacement to fully fix.
[16:18:55] <ottomata>	 awesome thank you, i just saw another undelete come through on sr.wikipedia.org, so its working everywhere
[16:20:06] <wikibugs>	 (03PS11) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[16:20:08] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-volume init module: move SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/887341
[16:20:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[16:22:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[16:23:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume init module: move SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/887341 (owner: 10Andrew Bogott)
[16:23:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43764 and previous config saved to /var/cache/conftool/dbconfig/20230207-162357-root.json
[16:24:06] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) We 've discussed this internally within the team. **We realize that it's not possible to exclude wikitech from the s...
[16:24:17] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on kafka-logging2001 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[16:24:36] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-logging2001 is OK: SSL OK - Certificate kafka-logging2001.codfw.wmnet valid until 2023-09-12 07:55:00 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[16:24:58] <icinga-wm>	 RECOVERY - Check systemd state on kafka-logging2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:25:07] <herron>	 now to make single broker down stop paging 🤨
[16:25:16] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003
[16:25:35] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:25:42] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002
[16:26:31] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage
[16:26:39] <wikibugs>	 (03PS7) 10Urbanecm: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[16:27:28] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Check console cable for asw-a2-codfw - https://phabricator.wikimedia.org/T329055 (10Papaul) 05Open→03Resolved a:03Papaul The port was moved on the console server from port 18 to port 41 some days back when we did have some issues but I never...
[16:28:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata)
[16:30:05] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite)
[16:30:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:39] <wikibugs>	 (03PS1) 10Herron: kafka-logging: don't page on individual broker down [puppet] - 10https://gerrit.wikimedia.org/r/887342
[16:31:10] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage
[16:31:20] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage
[16:31:52] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) @nskaggs, @bd808 (feel free to add others), let me know what you think.
[16:32:47] <wikibugs>	 (03Merged) 10jenkins-bot: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm)
[16:32:49] <wikibugs>	 (03Merged) 10jenkins-bot: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm)
[16:32:53] <urbanecm>	 finally 
[16:33:15] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi)
[16:33:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:09] <urbanecm>	 ottomata: pulled to mwdebug1001, can you test it there please?
[16:34:19] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage
[16:34:19] <wikibugs>	 (03PS3) 10Jforrester: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770
[16:34:23] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[16:34:44] <ottomata>	 urbanecm: both the config change and the eventbus chagne?
[16:34:50] <urbanecm>	 that is correct
[16:35:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr) ms-fe1013                   D2 U35    PORT13  4902  ms-fe1014                   F1  U38.   PORT  20220049  thanos-fe1004            F1  U3...
[16:35:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr)
[16:35:42] <ottomata>	 i think...i can't test this unless the config change is on meta.  hm.  
[16:36:39] <urbanecm>	 ottomata: wdym? the config change is at mwdebug1001 meta
[16:36:50] <urbanecm>	 does it need to be at production meta for some reason?
[16:38:36] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10bd808) >>! In T328768#8594394, @akosiaris wrote: > @nskaggs, @bd808 (feel free to add others), let me know what you think.  Opt...
[16:38:50] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) Please apologies if I am wrong, which I am probably am, but...  > Wikitech read requests will flow to eqiad, and write...
[16:39:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43765 and previous config saved to /var/cache/conftool/dbconfig/20230207-163902-root.json
[16:39:03] <ottomata>	 i can trigger the event production from testwiki
[16:39:19] <ottomata>	 but, it gets produced to an eventgate instance, which looks up global stream config from metawiki
[16:40:39] <urbanecm>	 ah
[16:40:45] <ottomata>	 i could maybe tail eventgate logs and see it attempt to receive the new eventstream and error
[16:40:47] <urbanecm>	 so then i think we need to sync, and hope
[16:40:51] <urbanecm>	 or that
[16:40:59] <urbanecm>	 your call :)
[16:41:00] <ottomata>	 lets just sync, this stream is non production, nothing will break
[16:41:03] <urbanecm>	 okay
[16:41:05] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017)
[16:41:07] <urbanecm>	 sync started
[16:41:09] <stashbot>	 T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017
[16:41:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi)
[16:46:38] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1045.eqiad.wmnet with OS bullseye
[16:48:37] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017) (duration: 07m 32s)
[16:48:40] <stashbot>	 T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017
[16:48:46] <urbanecm>	 ottomata: and live
[16:48:50] <urbanecm>	 and i think we're done now
[16:49:05] <ottomata>	 ok checkign!
[16:49:46] <ottomata>	 it works!  Thank you urbanecm 
[16:49:58] <urbanecm>	 awesome
[16:50:04] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2046.codfw.wmnet with OS bullseye
[16:50:18] <wikibugs>	 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10colewhite) Hmm... It appears there is a silence management UI in AlertManager, but the supporting UI code is not deployed with the deb package.  In additi...
[16:51:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[16:52:50] <wikibugs>	 (03PS12) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729)
[16:52:52] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-volume.conf: include common oslo-messaging-rabbit section [puppet] - 10https://gerrit.wikimedia.org/r/887368 (https://phabricator.wikimedia.org/T324729)
[16:53:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume.conf: include common oslo-messaging-rabbit section [puppet] - 10https://gerrit.wikimedia.org/r/887368 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[16:53:46] <ottomata>	 urbanecm: thank you so much for doing that outside of the window. really appreciate it!
[16:53:52] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron)
[16:58:40] <urbanecm>	 glad i could be helpful
[16:59:42] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) > How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the...
[17:00:05] <jouncebot>	 jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:33] <rzl>	 jouncebot: you're too late, I'm already taking the moon
[17:00:43] <wikibugs>	 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) 05In progress→03Resolved remote hands successfully removed the optic this AM and placed it in our racks, we'll just have it thrown away next remote hands work...
[17:00:52] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:19] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) >>! In T328768#8594700, @akosiaris wrote: >> How is this possible, if there are no codfw app servers serving wikitech?...
[17:06:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis) Hi @jbond and @MoritzMuehlenhoff - how are things looking with regard to this OIDC support?  We would still like to be able to {T305874} using idp because the LDAP...
[17:06:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis)
[17:06:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10BTullis)
[17:07:57] <wikibugs>	 (03PS1) 10Raymond Ndibe: puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663)
[17:08:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[17:09:37] <wikibugs>	 (03PS2) 10Raymond Ndibe: puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663)
[17:09:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[17:12:30] <wikibugs>	 (03PS3) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663)
[17:13:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10jbond) @BTullis OIDC support is now possible and is being tried out by the new IDM.  It should be to a state where you can start using it and happy to help out/provide more...
[17:15:47] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "I see a couple of issues here:" [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[17:15:57] <wikibugs>	 (03PS2) 10Jbond: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593)
[17:17:54] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:19:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: consolidate extra floating IP routes [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041)
[17:19:26] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774)
[17:21:34] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2047.codfw.wmnet with OS bullseye
[17:22:13] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1046.eqiad.wmnet with OS bullseye
[17:22:48] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) No blocker on my side, then. Supporting path 5 (security worried me more than performance).
[17:28:00] <wikibugs>	 (03PS1) 10Jdlrobson: [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717)
[17:31:10] <jynus>	 [a3e0ac52-9983-4e72-b47e-d455e43bd181] 2023-02-07 17:30:13: Fatal exception of type "Wikimedia\RequestTimeout\RequestTimeoutException" on frwiki, potentially because of template vandalism?
[17:31:31] <jynus>	 rzl, jhathaway^
[17:31:40] <ma>	 jynus: see -security
[17:31:49] <jhathaway>	 jynus: thanks
[17:34:01] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage
[17:35:13] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Cleanup: Drop pre-python3.7 support [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus)
[17:36:52] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup: Drop pre-python3.7 support [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus)
[17:37:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage
[17:37:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage
[17:38:06] <wikibugs>	 (03CR) 10Mabualruz: "is this a duplicate of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/886852 or is it for the train?" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson)
[17:40:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage
[17:45:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:49:24] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "This is now ready to go I think? The mediawiki-config patch has gone out meanwhile, which removes these old schemas from the EventLogging " [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog)
[17:51:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1046.eqiad.wmnet with OS bullseye
[17:53:28] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2047.codfw.wmnet with OS bullseye
[17:55:48] <inflatador>	 !log bking@cumin1001 repooling elastic and wdqs hosts post-maintenance T327925
[17:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:51] <stashbot>	 T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1800)
[18:00:08] <wikibugs>	 (03PS1) 10Jdlrobson: Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212)
[18:00:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for 13 hosts
[18:02:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 13 hosts
[18:05:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:22] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventlogging: Remove obsoleted navtiming schemas [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog)
[18:09:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:17:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1047.eqiad.wmnet with OS bullseye
[18:18:04] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2048.codfw.wmnet with OS bullseye
[18:23:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yes, the old recipes for ganeti VMs should be removed and the idea is very reasoanable. I won't pretend I can actually review (in a testin" [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) (owner: 10Jelto)
[18:25:55] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team, 10decommission-hardware: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Papaul)
[18:26:58] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team, 10decommission-hardware: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Papaul) 05Open→03Resolved This is complete
[18:28:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[18:29:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage
[18:32:02] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage
[18:34:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage
[18:34:55] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack::cinder::volume: Pass $version down to the config module [puppet] - 10https://gerrit.wikimedia.org/r/887378
[18:37:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::cinder::volume: Pass $version down to the config module [puppet] - 10https://gerrit.wikimedia.org/r/887378 (owner: 10Andrew Bogott)
[18:37:23] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10Papaul) p:05Triage→03Medium a:03cmooney
[18:37:28] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage
[18:40:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) fyi i tested connecting temporary the xe-0/0/47 to cr2 xe-5/0/0 link was okay ` papaul@re0.cr2-...
[18:42:38] <wikibugs>	 (03PS1) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595)
[18:43:32] <wikibugs>	 (03PS2) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595)
[18:47:05] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1047.eqiad.wmnet with OS bullseye
[18:50:51] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:51:44] <wikibugs>	 (03PS1) 10Dzahn: add SPDX license headers to various roles I was involved in writing [puppet] - 10https://gerrit.wikimedia.org/r/887382
[18:52:53] <wikibugs>	 (03PS2) 10Dzahn: add SPDX license headers to various roles I was involved in writing [puppet] - 10https://gerrit.wikimedia.org/r/887382
[18:53:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2048.codfw.wmnet with OS bullseye
[18:55:02] <wikibugs>	 (03PS3) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595)
[18:57:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:59:42] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Dzahn) @Shizhao Great! Thanks. In case it happens again feel free to just reopen this or make a new ticke...
[19:00:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2423,25,26,27 DNS - pt1979@cumin2002"
[19:00:05] <jouncebot>	 ^demon and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1900).
[19:00:26] <dancy>	 o/
[19:00:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2423,25,26,27 DNS - pt1979@cumin2002"
[19:00:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:01:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED
[19:03:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED
[19:03:53] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bullseye
[19:04:09] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:04:13] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1049.eqiad.wmnet with OS bullseye
[19:04:32] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) As far as I can tell nowadays there is no more node that uses multiple roles. Only one role at a time, s...
[19:05:27] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) https://gerrit.wikimedia.org/r/q/topic:%22role-profile%22+(status:open%20OR%20status:merged)
[19:07:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[19:11:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729)
[19:12:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[19:13:33] <wikibugs>	 (03PS2) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729)
[19:15:31] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:15:52] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage
[19:16:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:20] <wikibugs>	 (03PS3) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729)
[19:18:40] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage
[19:20:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage
[19:21:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:16] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:23:02] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage
[19:25:51] <icinga-wm>	 RECOVERY - Check systemd state on mw2350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[19:28:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Jclark-ctr) @Marostegui replaced failed drive
[19:33:42] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1049.eqiad.wmnet with OS bullseye
[19:39:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2049.codfw.wmnet with OS bullseye
[19:40:49] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[19:41:19] <wikibugs>	 (03PS2) 10Bking: Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[19:41:38] <wikibugs>	 (03PS4) 10Bking: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[19:43:18] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan)
[19:44:33] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED
[19:44:36] <wikibugs>	 (03CR) 10Bking: [V: 03+1] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[19:44:37] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED
[19:44:53] <wikibugs>	 (03CR) 10Bking: [V: 03+1 C: 03+2] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson)
[19:45:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED
[19:46:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED
[19:46:17] <wikibugs>	 (03PS1) 10Dzahn: phorge: list of apache modules needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/887392 (https://phabricator.wikimedia.org/T328595)
[19:47:13] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585)
[19:47:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED
[19:47:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[19:47:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phorge: list of apache modules needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/887392 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[19:47:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED
[19:47:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[19:47:52] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot)
[19:48:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED
[19:52:10] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan)
[19:53:37] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:53:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1005.eqiad.wmnet
[19:53:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[19:54:07] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[19:55:15] <logmsgbot>	 !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.22  refs T325585
[19:55:18] <stashbot>	 T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585
[19:55:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[19:56:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1005.eqiad.wmnet - bking@cumin1001"
[19:57:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1005.eqiad.wmnet - bking@cumin1001"
[19:57:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:57:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache an-airflow1005.eqiad.wmnet on all recursors
[19:57:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1005.eqiad.wmnet on all recursors
[19:58:49] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED
[19:59:04] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[20:00:22] <wikibugs>	 (03PS1) 10Jdlrobson: Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996)
[20:04:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[20:08:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED
[20:09:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED
[20:13:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED
[20:13:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[20:15:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729)
[20:15:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:15:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[20:17:39] <wikibugs>	 (03PS2) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729)
[20:17:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[20:20:27] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:48] <wikibugs>	 (03PS3) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729)
[20:21:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1051.eqiad.wmnet with OS bullseye
[20:21:54] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2050.codfw.wmnet with OS bullseye
[20:22:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[20:23:45] <wikibugs>	 (03PS1) 10Dzahn: phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595)
[20:24:50] <wikibugs>	 (03PS2) 10Dzahn: phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595)
[20:25:45] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:26:19] <wikibugs>	 (03PS4) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729)
[20:27:21] <wikibugs>	 (03PS2) 10Ottomata: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[20:27:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:27:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:29:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[20:29:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[20:30:38] <wikibugs>	 (03CR) 10Sharvaniharan: Fix android session schema path (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[20:33:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage
[20:33:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:36:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage
[20:37:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis) @jbond - Many thanks. That's excellent. I think I'd be keen to look at doing that and helping find out the issues. I've asked the #data-engineering team so I'll get back to you in a coup...
[20:38:04] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage
[20:39:51] <wikibugs>	 (03CR) 10Ottomata: Fix android session schema path (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[20:40:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[20:41:10] <wikibugs>	 (03Merged) 10jenkins-bot: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan)
[20:41:13] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage
[20:41:22] <ottomata>	 dancy: am i clear to deploy a mw config change?
[20:42:14] <dancy>	 I think so.  Train was rolled forward about an hour ago
[20:44:39] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1005.eqiad.wmnet
[20:45:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:46:22] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729)
[20:48:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:49:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:50:01] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1051.eqiad.wmnet with OS bullseye
[20:52:14] <wikibugs>	 (03PS1) 10Bking: search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970)
[20:53:33] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add netmon1003 to the ganeti rapi nodes list [puppet] - 10https://gerrit.wikimedia.org/r/887409 (https://phabricator.wikimedia.org/T309074)
[20:54:23] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "1 excess newline but otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking)
[20:55:11] <wikibugs>	 (03PS2) 10Bking: search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970)
[20:55:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:57:19] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2050.codfw.wmnet with OS bullseye
[20:57:23] <wikibugs>	 (03PS1) 10Dzahn: phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595)
[20:57:45] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:58:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:58:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2051.codfw.wmnet with OS bullseye
[20:59:15] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 1.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:00:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T2100). nyaa~
[21:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:23] <urbanecm>	 I can deploy today
[21:00:33] <urbanecm>	 Jdlrobson: hi, around?
[21:01:01] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1053.eqiad.wmnet with OS bullseye
[21:02:20] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventSreams - Fix android session schema path (duration: 07m 26s)
[21:03:01] <urbanecm>	 ottomata: pro-tip, use `scap backport 886995` next time. it does everything for you (from merging to gerrit to deployment)
[21:03:04] <urbanecm>	 very convenient :)
[21:03:06] <Jdlrobson>	 urbanecm present
[21:03:18] <urbanecm>	 hi!
[21:03:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson)
[21:03:55] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson)
[21:04:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson)
[21:04:52] <Jdlrobson>	 :)
[21:05:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:06:44] <wikibugs>	 (03CR) 10Cwhite: "This seems reasonable.  Do we have alarms in place with paging for majority up? (e.g. 2 of 3 down issues a page)" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron)
[21:07:09] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi)
[21:07:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [followup] mediawiki.feedlink: Atom's link icon overlaps the link (031 comment) [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson)
[21:07:46] <wikibugs>	 (03PS2) 10Andrew Bogott: cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729)
[21:07:48] <wikibugs>	 (03PS2) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729)
[21:08:00] <wikibugs>	 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10EChetty)
[21:08:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887409 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[21:08:33] <wikibugs>	 (03PS2) 10Dzahn: phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595)
[21:08:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson)
[21:08:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson)
[21:09:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson)
[21:09:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:09:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:10:16] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[21:10:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[21:11:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:11:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[21:12:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED
[21:12:47] <wikibugs>	 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse) a:05andrea.denisse→03None
[21:12:52] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage
[21:13:12] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse)
[21:13:25] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse) a:05andrea.denisse→03None
[21:14:04] <wikibugs>	 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse) 05Resolved→03Open
[21:14:20] <wikibugs>	 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse) 05Resolved→03Open
[21:14:34] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage
[21:15:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage
[21:16:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:17:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED
[21:18:35] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage
[21:18:47] <wikibugs>	 (03Merged) 10jenkins-bot: Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson)
[21:19:39] <wikibugs>	 (03PS3) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729)
[21:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson)
[21:20:00] <wikibugs>	 (03Merged) 10jenkins-bot: [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson)
[21:20:28] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]]
[21:20:34] <stashbot>	 T327717: Page tools: Support icon for Atom link (was Atom's link icon overlaps the link) - https://phabricator.wikimedia.org/T327717
[21:20:35] <stashbot>	 T289212: Feature request: Login button doesn't appear besides the create account text with new compact user menu - https://phabricator.wikimedia.org/T289212
[21:20:35] <stashbot>	 T328996: [Regression] History page help icon moved out of the header - https://phabricator.wikimedia.org/T328996
[21:21:03] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED
[21:21:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED
[21:22:20] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:22:33] <wikibugs>	 (03PS4) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729)
[21:22:34] <urbanecm>	 Jdlrobson: all three backports are at debug servers. can you check them please?
[21:22:43] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED
[21:22:47] <Jdlrobson>	 looking now
[21:22:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:24:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott)
[21:24:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED
[21:24:51] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking)
[21:25:46] <Jdlrobson>	 urbanecm: all 3 look good to me
[21:25:50] <urbanecm>	 syncing
[21:26:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:26:20] <urbanecm>	 Superpes: hi! let me know once you added your patch to the calendar too :)
[21:26:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED
[21:29:06] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1053.eqiad.wmnet with OS bullseye
[21:29:35] <wikibugs>	 (03PS3) 10Superpes15: Install WikiLove extension on bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834)
[21:31:38] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]] (duration: 11m 10s)
[21:31:43] <urbanecm>	 Jdlrobson: all three live :)
[21:31:44] <stashbot>	 T327717: Page tools: Support icon for Atom link (was Atom's link icon overlaps the link) - https://phabricator.wikimedia.org/T327717
[21:31:44] <stashbot>	 T289212: Feature request: Login button doesn't appear besides the create account text with new compact user menu - https://phabricator.wikimedia.org/T289212
[21:31:44] <stashbot>	 T328996: [Regression] History page help icon moved out of the header - https://phabricator.wikimedia.org/T328996
[21:32:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834) (owner: 10Superpes15)
[21:32:14] <urbanecm>	 Superpes: let's get started :)
[21:32:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED
[21:32:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:32:34] <Superpes>	 urbanecm Yep! Thanks :)
[21:32:48] <wikibugs>	 (03Merged) 10jenkins-bot: Install WikiLove extension on bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834) (owner: 10Superpes15)
[21:32:49] <Jdlrobson>	 thanks urbanecm double checking as we speak
[21:33:14] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]]
[21:33:17] <stashbot>	 T328834: Enable WikiLove extension on bnwikiquote - https://phabricator.wikimedia.org/T328834
[21:33:20] <urbanecm>	 !log Create extension tables for Wikilove on bnwikiquote (T328834)
[21:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:49] <wikibugs>	 (03PS10) 10Urbanecm: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[21:34:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2051.codfw.wmnet with OS bullseye
[21:35:03] <logmsgbot>	 !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[21:35:12] <urbanecm>	 Superpes: can you check it at mwdebug1001 please? :)
[21:35:18] <urbanecm>	 (it=the wikilove patch)
[21:36:30] <Superpes>	 Oh yep
[21:38:18] <urbanecm>	 let me know how it goes :)
[21:38:28] <Superpes>	 urbanecm Lol I cannot login there 
[21:39:29] <ma>	 Superpes: you need x-wikimedia-debug
[21:39:30] <urbanecm>	 Superpes: what do you mean, please? You can load a wiki via mwdebug1001 (or other debug servers) by installing https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions and enabling it
[21:39:42] <wikibugs>	 (03CR) 10Herron: kafka-logging: don't page on individual broker down (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron)
[21:39:51] <Jdlrobson>	 Looking good urbanecm thanks for your help as alwayS!
[21:39:57] <urbanecm>	 then, you can load bnwikiquote using a staging environment at the debug server, and verify the change works as intended.
[21:39:59] <urbanecm>	 Jdlrobson: happy to help!
[21:40:06] <Superpes>	 Yep I'm using WikimediaDebug urbanecm but I can't see if the Wikilove extensions actually works :/
[21:40:50] <urbanecm>	 can you clarify why please? :)
[21:41:35] <ma>	 urbanecm: maybe I can help with testing if needed?
[21:42:07] <urbanecm>	 i can test it as well (it appears to work for me), I'm trying to understand Superpes's issues, so they can test future patches too.
[21:42:11] <urbanecm>	 thanks for the offer though :)
[21:42:33] <ma>	 no problem, happy to help
[21:42:41] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:42:46] <urbanecm>	 I'm proceeding with the sync, as the extension appears to work. still happy to help with the testing issue though :)
[21:42:48] <Superpes>	 urbanecm Uhm the problem is that I don't see any wikilove tool via IP (and can't login)
[21:42:56] <urbanecm>	 ah! you meant login on the wiki
[21:42:59] <Superpes>	 Thanks urbanecm let's talk later about it 
[21:43:02] <urbanecm>	 gotcha
[21:43:02] <Superpes>	 Yep
[21:43:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[21:44:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:44:22] <wikibugs>	 (03Merged) 10jenkins-bot: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15)
[21:48:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]] (duration: 15m 32s)
[21:48:51] <stashbot>	 T328834: Enable WikiLove extension on bnwikiquote - https://phabricator.wikimedia.org/T328834
[21:48:56] <urbanecm>	 Superpes: the patch's live now
[21:49:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:49:27] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]]
[21:49:31] <stashbot>	 T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047
[21:49:37] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:49:42] <Superpes>	 Ok thanks :)
[21:51:14] <logmsgbot>	 !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:51:34] <urbanecm>	 Superpes: trwiki patch's at mwdebug1001. can you check this one please? or do you want me to help too?
[21:51:37] <wikibugs>	 (03PS1) 10Cwhite: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806)
[21:51:39] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: add sleep for race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/887416
[21:51:50] <Superpes>	 urbanecm The logo is without a wordmark... So only the image remains without text :/ I don't know if they expected this! The rest seems ok
[21:53:03] <urbanecm>	 Superpes: the wordmark's not present in the png file, so it's not included. i guess that's an error in the file, is that right?
[21:53:32] <Superpes>	 Yep! They probably didn't think about inserting a logo with text, so I don't think it will be a problem, but objectively it is not the best choice... But I added the logo they posted ;)
[21:53:51] <urbanecm>	 so, should we go ahead? or revert?
[21:53:53] <Superpes>	 So nothing wrong from my side!
[21:54:00] <urbanecm>	 okay, proceeding
[21:54:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "trivial, self-merging to check if this solves the problem or not" [cookbooks] - 10https://gerrit.wikimedia.org/r/887416 (owner: 10Volans)
[21:54:32] <wikibugs>	 (03PS8) 10Urbanecm: Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15)
[21:54:44] <Superpes>	 Yep thanks ;) I'll ask them in the task if they want another image with text!
[21:54:55] <urbanecm>	 sounds good
[21:55:54] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: add sleep for race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/887416 (owner: 10Volans)
[21:56:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15)
[21:56:56] <wikibugs>	 (03Merged) 10jenkins-bot: Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15)
[21:57:04] <wikibugs>	 (03PS2) 10Cwhite: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806)
[21:59:15] <wikibugs>	 (03PS1) 10Cwhite: logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806)
[21:59:47] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]] (duration: 10m 20s)
[21:59:50] <stashbot>	 T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047
[21:59:59] <urbanecm>	 second patch's live
[22:00:16] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]]
[22:00:19] <stashbot>	 T328194: Enable AbuseFilter blocks on itwikiversity  - https://phabricator.wikimedia.org/T328194
[22:01:58] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and superpes: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[22:02:28] <urbanecm>	 Superpes: i guess this is hard for you to test, right?
[22:02:31] <Superpes>	 @urbanec It works properly :)
[22:02:43] <Superpes>	 * urbanecm
[22:02:48] <urbanecm>	 okaz, great
[22:04:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[22:06:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "provision new Ganeti VM an-airflow1005 - bking@cumin1001 - T327970"
[22:07:02] <stashbot>	 T327970: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970
[22:07:40] <wikibugs>	 (03PS2) 10Cwhite: logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806)
[22:08:39] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]] (duration: 08m 23s)
[22:08:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[22:08:42] <stashbot>	 T328194: Enable AbuseFilter blocks on itwikiversity  - https://phabricator.wikimedia.org/T328194
[22:08:49] <urbanecm>	 Superpes: and, third patch live. 
[22:09:09] <Superpes>	 urbanecm Thanks for your time and for support :D
[22:09:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[22:09:15] <urbanecm>	 happy to help!
[22:09:59] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "provision new Ganeti VM an-airflow1005 - bking@cumin1001 - T327970"
[22:10:34] <wikibugs>	 (03CR) 10Cwhite: kafka-logging: don't page on individual broker down (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron)
[22:12:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[22:12:25] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite)
[22:13:14] <wikibugs>	 (03PS1) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553)
[22:14:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B6 - pt1979@cumin2002"
[22:14:49] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[22:15:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B6 - pt1979@cumin2002"
[22:15:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:15:51] <wikibugs>	 (03CR) 10JHathaway: "Giuseppe would you please take another look at this updated patch when you have the time." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[22:16:02] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2428.mgmt.codfw.wmnet with reboot policy FORCED
[22:16:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2429.mgmt.codfw.wmnet with reboot policy FORCED
[22:20:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, and 2 others: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10taavi)
[22:26:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2428.mgmt.codfw.wmnet with reboot policy FORCED
[22:30:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2429.mgmt.codfw.wmnet with reboot policy FORCED
[22:31:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2430.mgmt.codfw.wmnet with reboot policy FORCED
[22:31:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2431.mgmt.codfw.wmnet with reboot policy FORCED
[22:33:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[22:41:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2431.mgmt.codfw.wmnet with reboot policy FORCED
[22:41:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[22:41:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2430.mgmt.codfw.wmnet with reboot policy FORCED
[22:42:15] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[22:43:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B8 - pt1979@cumin2002"
[22:44:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B8 - pt1979@cumin2002"
[22:44:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:45:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2432.mgmt.codfw.wmnet with reboot policy FORCED
[22:46:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2433.mgmt.codfw.wmnet with reboot policy FORCED
[22:53:21] <icinga-wm>	 RECOVERY - MegaRAID on db1155 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:56:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2433.mgmt.codfw.wmnet with reboot policy FORCED
[22:56:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2432.mgmt.codfw.wmnet with reboot policy FORCED
[22:59:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2434.mgmt.codfw.wmnet with reboot policy FORCED
[22:59:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2435.mgmt.codfw.wmnet with reboot policy FORCED
[23:03:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Jclark-ctr Sorry for the delay on this. They weren't urgent and now the December fundraising is complete. You are clear to rack and cable...
[23:06:43] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2435.mgmt.codfw.wmnet with reboot policy FORCED
[23:06:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2434.mgmt.codfw.wmnet with reboot policy FORCED
[23:13:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[23:14:27] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:22:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2420']
[23:23:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2421']
[23:29:29] <wikibugs>	 (03Abandoned) 10Aaron Schulz: Avoid udp2log for "objectcache" channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712548 (https://phabricator.wikimedia.org/T288702) (owner: 10Aaron Schulz)
[23:30:05] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2420']
[23:31:09] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2421']
[23:32:35] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2422']
[23:32:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2423']
[23:45:54] <wikibugs>	 (03PS1) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595)
[23:46:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[23:47:50] <wikibugs>	 (03PS2) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595)
[23:48:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[23:49:50] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2422']
[23:50:06] <wikibugs>	 (03PS3) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595)
[23:51:16] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2423']
[23:56:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2424']
[23:56:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2425']
[23:57:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn)
[23:57:14] <wikibugs>	 (03PS1) 10Dzahn: phorge: git clone arcanist also from we.phorge.it, not Phacility [puppet] - 10https://gerrit.wikimedia.org/r/887431 (https://phabricator.wikimedia.org/T328595)