[00:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:01] !refresh-topic [00:09:26] man I get baited by that every time [00:09:28] !oncall-now [00:09:28] Oncall now for team SRE, rotation batphone: [00:09:28] m.utante, j.hathaway, c.white, s.lyngs, l.mata, h.erron, r.zl, b.black, c.danis, s.ukhe, i.nflatador, v.olans, r.obh, u.random [00:09:47] cool, next auto update should fix the topic [00:10:40] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:12] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [00:15:38] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:19:16] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:06] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:04] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:55:55] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [01:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10phaultfinder) [01:05:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:16] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:25] (03PS1) 10Sharvaniharan: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 [01:57:25] (03CR) 10Sharvaniharan: "Please review when you get a chance. Minor path change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [01:58:50] (03CR) 10Sharvaniharan: "Hi @Ottomata... Is it enough to just change the path and get it deployed again, or will I need to add a new entry?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [02:03:38] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:17] (03PS1) 10Bartosz Dziewoński: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) [02:20:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0300) [03:07:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585) [03:07:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [03:22:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.22 [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886849 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [03:24:56] (03PS7) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [03:26:46] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [03:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:09] (03PS8) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [03:35:00] (03CR) 10jenkins-bot: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [03:36:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:21] (03PS9) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [03:42:09] (03CR) 10CI reject: [V: 04-1] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [03:44:57] (03PS10) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [03:45:34] (03CR) 10Andrew Bogott: [C: 03+2] puppet-enc.py: remove a newline to make Black happy [puppet] - 10https://gerrit.wikimedia.org/r/886898 (owner: 10Andrew Bogott) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0400) [04:01:28] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585) [04:01:30] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [04:02:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887001 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [04:02:37] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.22 refs T325585 [04:02:56] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [04:08:52] PROBLEM - Disk space on deploy1002 is CRITICAL: DISK CRITICAL - free space: /srv 13208 MB (4% inode=77%): /srv/docker/overlay2/4e3cf33c6b5d21c9736e6bffc3ee5015324fa3d16a380e781af0d2df28ac71f8/merged 13208 MB (4% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [04:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:48] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.22 refs T325585 (duration: 53m 11s) [04:56:07] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [04:58:10] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.20 (duration: 02m 20s) [05:04:12] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10Sreeji... [05:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:38:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:39:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:40:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:41:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:05:48] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) @Jhancock.wm can you confirm the server is meant to be off? I just tried to access it but I can't. [06:14:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/886904 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [06:18:17] (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404) [06:18:27] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [06:22:28] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:22:36] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1187', diff saved to https://phabricator.wikimedia.org/P43757 and previous config saved to /var/cache/conftool/dbconfig/20230207-062826-root.json [06:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2110 in API', diff saved to https://phabricator.wikimedia.org/P43758 and previous config saved to /var/cache/conftool/dbconfig/20230207-063147-root.json [06:35:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:23] (03PS1) 10Alexandros Kosiaris: Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931) [06:55:25] I am going to switch phabricator to read only for a minute in 5 minutes to complete https://phabricator.wikimedia.org/T328404 [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0700) [07:00:05] kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0700) [07:00:06] !log Failover m3 from db1159 to db1164 - T328404 [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:10] T328404: Switchover m3 master db1159 -> db1164 - https://phabricator.wikimedia.org/T328404 [07:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:19] This was done, read only time was 20 seconds [07:03:20] (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/887009 (https://phabricator.wikimedia.org/T328404) (owner: 10Marostegui) [07:05:44] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:05:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:06:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:37] (03CR) 10Ayounsi: Add BGP community to all k8s advertisments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [07:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:37] (03PS6) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [07:18:56] (03PS11) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [07:20:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:52] (03CR) 10JMeybohm: [C: 04-1] Add BGP community to all k8s advertisments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:00:04] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:11] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00242: FAILED: internal_api_error_UploadChunkFileException: [3ef1b160-5844-46c6-9ddd-333c27... [08:05:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:08] hi, I have a config patch to deploy, will add to the calendar now [08:13:08] (03PS2) 10Kosta Harlan: GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) [08:14:39] !log kharlan@deploy1002 backport aborted: (duration: 00m 07s) [08:15:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [08:15:54] (03Merged) 10jenkins-bot: GrowthExperiments: Disable leveling up features in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886343 (https://phabricator.wikimedia.org/T328757) (owner: 10Kosta Harlan) [08:16:39] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]] [08:16:42] T328757: Leveling up: Define feature flag for gating the functionality - https://phabricator.wikimedia.org/T328757 [08:18:30] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:18:36] (03PS8) 10Volans: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [08:19:17] (03PS1) 10Kosta Harlan: labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 [08:20:31] (03PS7) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [08:21:28] hmm, this is new `Check 'Check endpoints for mw1416.eqiad.wmnet' failed: /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 500 (expecting: 200)` [08:21:30] (03CR) 10CI reject: [V: 04-1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:21:45] ^ cc Amir1 urbanecm if you're around [08:22:09] otherwise, `sync-check-canaries` finished without incident [08:23:05] (03PS8) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [08:24:24] kostajh: seems to be an onetime error; Special:Version appears to work fine at that server now. [08:26:34] ack [08:28:50] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:886343|GrowthExperiments: Disable leveling up features in production (T328757)]] (duration: 12m 11s) [08:28:54] T328757: Leveling up: Define feature flag for gating the functionality - https://phabricator.wikimedia.org/T328757 [08:30:04] (03PS9) 10Ayounsi: Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) [08:30:22] urbanecm: do you want me to sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/883153 and its child patch? [08:30:29] (03CR) 10Ayounsi: "Thanks for the help here and on IRC!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:30:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:51] kostajh: sure, that'd be great. [08:30:55] ack [08:30:55] !log installing imagemagick security updates on Thumbor T328901 [08:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:59] (03PS2) 10Kosta Harlan: Remove GEMentorProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:31:09] i've removed my -2 on t. thanks! [08:31:27] (03CR) 10Kosta Harlan: Remove GEMentorProvider (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:31:34] (03PS2) 10Kosta Harlan: [Growth] Remove mentor list variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:32:14] adding to the calendar [08:33:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:34:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:34:46] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) Staging the new version on the switches: `asw-a-codfw> request system software add force-host set [ /var/tmp/jinstall-ex-4300-21.4R3-S1.5-signed... [08:35:03] (03Merged) 10jenkins-bot: Remove GEMentorProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883153 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:35:06] (03Merged) 10jenkins-bot: [Growth] Remove mentor list variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/883236 (https://phabricator.wikimedia.org/T321501) (owner: 10Urbanecm) [08:35:30] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]] [08:35:33] T321501: Post-structured mentor list cleanup - https://phabricator.wikimedia.org/T321501 [08:35:39] (03CR) 10JMeybohm: [C: 03+1] Allow AS loops in eqiad staging k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/886328 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:37:20] !log kharlan@deploy1002 urbanecm and kharlan: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:37:50] (03CR) 10Volans: [C: 03+1] "LGTM, couple of minor things inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [08:39:33] (03CR) 10Elukey: "Hi folks! I have another cookbook that uses this to upgrade a k8s cluster, do you have a target date to have it merged? No hurry just to s" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [08:39:55] kostajh: do you want me to test those? [08:40:09] (03CR) 10Elukey: "This is currently blocked by https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/865057" [cookbooks] - 10https://gerrit.wikimedia.org/r/886317 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:40:28] urbanecm: sure, they are ready for verification [08:41:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris) [08:42:04] kostajh: cswiki continues to be on structured mentor list, so those should be good to go [08:42:10] urbanecm: nothing seems broken [08:42:12] yup [08:45:02] (syncing) [08:45:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:35] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) FWIW, I can't find any trace of that filename in the swift proxy logs ` cumin -x O:swift:... [08:47:00] (03Merged) 10jenkins-bot: Include wiktionary in page/html rerenders [deployment-charts] - 10https://gerrit.wikimedia.org/r/887082 (https://phabricator.wikimedia.org/T226931) (owner: 10Alexandros Kosiaris) [08:48:18] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:883236|[Growth] Remove mentor list variables (T321501)]], [[gerrit:883153|Remove GEMentorProvider (T321501)]] (duration: 12m 48s) [08:48:24] done [08:48:35] T321501: Post-structured mentor list cleanup - https://phabricator.wikimedia.org/T321501 [08:48:44] !log UTC morning deploys done [08:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:30] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [08:49:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:12] !log rolling upgrade to HAProxy 2.4.21 in cp nodes [08:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:46] (03PS12) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [08:50:49] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:52:53] (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [08:56:29] (03CR) 10JMeybohm: [C: 03+1] "I was quite happy with this version, so +1 from me" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [08:58:13] (03PS1) 10Filippo Giunchedi: admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) [08:59:56] (03PS1) 10Filippo Giunchedi: admin: add 'anil' to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805) [09:02:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast2002.wikimedia.org with OS bullseye [09:02:56] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast2002.wikimedia.org with OS bullseye [09:03:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [09:03:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805) (owner: 10Filippo Giunchedi) [09:04:15] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add 'anil' to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/887273 (https://phabricator.wikimedia.org/T328805) (owner: 10Filippo Giunchedi) [09:05:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi) [09:05:44] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi) [09:05:49] (03PS2) 10Filippo Giunchedi: admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) [09:06:11] (03CR) 10JMeybohm: Add sre.discovery.datacenter-route (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [09:06:46] (03CR) 10Filippo Giunchedi: [V: 03+2] admin: add santosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/887272 (https://phabricator.wikimedia.org/T328517) (owner: 10Filippo Giunchedi) [09:07:51] (03PS1) 10Marostegui: mariadb: Enable notifications db1164 [puppet] - 10https://gerrit.wikimedia.org/r/887274 [09:08:08] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [09:08:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10fgiunchedi) [09:08:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications db1164 [puppet] - 10https://gerrit.wikimedia.org/r/887274 (owner: 10Marostegui) [09:08:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T328517 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi No worries @Ottomata -- thanks for following up! Access will be fully live in 30 min, resolving. Though pleas... [09:15:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:20] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:19:31] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:19:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2002.wikimedia.org with reason: host reimage [09:20:01] (03PS9) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [09:20:22] !log add wiktionary to mobile-sections rerenders. T226931 [09:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:25] T226931: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 [09:20:28] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [09:20:46] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [09:20:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:38] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10taavi) [09:21:45] (03CR) 10CI reject: [V: 04-1] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [09:21:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2002.wikimedia.org with reason: host reimage [09:22:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:23:18] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 2 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10taavi) [09:23:54] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [09:24:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [09:24:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.813 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:24:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:25:44] 10SRE, 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10taavi) [09:27:34] 10SRE, 10Cloud-VPS, 10DNS, 10Traffic: PDNS in cloud can return inconsistent answers - https://phabricator.wikimedia.org/T281700 (10taavi) [09:30:27] 10SRE, 10Cloud-VPS, 10Traffic-Icebox, 10cloud-services-team: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10taavi) 05Open→03Resolved Per T249035#6122874 and lack of further updates. [09:31:39] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10taavi) [09:34:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inlines comments. Nice approach!" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [09:35:53] (03PS1) 10Nicolas Fraison: Add information and command rights on icinga to Nicolas Fraison [puppet] - 10https://gerrit.wikimedia.org/r/887276 [09:37:51] (03PS1) 10Filippo Giunchedi: clinic-duty: update message text fetching [software] - 10https://gerrit.wikimedia.org/r/887279 [09:39:07] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations: The Rack Puppet master server is deprecated and will be removed in a future release. Please use Puppet Server instead. - https://phabricator.wikimedia.org/T185815 (10taavi) 05Open→03Invalid I suspect this will be fixed by the Puppet 7 upg... [09:40:06] (03CR) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [09:40:28] (03CR) 10Clément Goubert: Add sre.discovery.datacenter-route (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [09:40:39] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: update message text fetching [software] - 10https://gerrit.wikimedia.org/r/887279 (owner: 10Filippo Giunchedi) [09:42:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast2002.wikimedia.org with OS bullseye [09:42:12] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast2002.wikimedia.org with OS bullseye completed: - bast2002 (**PASS**) - Downtimed on Icinga/Alertm... [09:42:49] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:44:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast1003.wikimedia.org with OS bullseye [09:44:22] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast1003.wikimedia.org with OS bullseye [09:45:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:51] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887282 (https://phabricator.wikimedia.org/T295774) [09:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:05] (03PS5) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [09:51:53] (03PS1) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) [09:53:07] (03PS1) 10Elukey: profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285 [09:54:39] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39417/console" [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey) [09:56:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1003.wikimedia.org with reason: host reimage [09:57:45] (03CR) 10Klausman: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [09:58:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Remove outdated invert-redis-sessions cookbook - https://phabricator.wikimedia.org/T329020 (10Clement_Goubert) [10:00:21] (03PS1) 10Clément Goubert: sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) [10:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:06] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [10:01:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1003.wikimedia.org with reason: host reimage [10:02:02] (03PS10) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:02:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert) [10:02:32] (03CR) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:02:54] (03CR) 10Clément Goubert: [C: 03+2] sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert) [10:03:03] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903 (10taavi) [10:03:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison) [10:03:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add BGP community to all k8s advertisments [deployment-charts] - 10https://gerrit.wikimedia.org/r/886329 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [10:03:31] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903 (10taavi) 05Open→03Declined Per above. [10:04:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:04:37] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:04:44] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Remove 05-invert-redis-sessions [cookbooks] - 10https://gerrit.wikimedia.org/r/887286 (https://phabricator.wikimedia.org/T329020) (owner: 10Clément Goubert) [10:04:57] (03CR) 10Clément Goubert: [C: 03+1] "LGTM :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:05:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The stat hosts are still on Buster, which doesn't have s3cmd, but it's present in buster-backports and will get installed from" [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey) [10:08:46] (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887288 (https://phabricator.wikimedia.org/T295774) [10:12:04] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in eqiad: Pooling eqiad for codfw depool today [10:12:09] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: Remove outdated invert-redis-sessions cookbook - https://phabricator.wikimedia.org/T329020 (10Clement_Goubert) 05In progress→03Resolved Wikitech page updated https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacente... [10:12:19] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [10:13:52] !log oblivian@cumin2002 END (FAIL) - Cookbook sre.discovery.datacenter-route (exit_code=93) pool all active/active services in eqiad: Pooling eqiad for codfw depool today [10:15:21] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:16:32] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team (Seen), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10taavi) [10:17:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10taavi) [10:17:13] (03PS11) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:17:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10taavi) [10:17:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast1003.wikimedia.org with OS bullseye [10:17:43] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast1003.wikimedia.org with OS bullseye completed: - bast1003 (**PASS**) - Downtimed on Icinga/Alertm... [10:18:43] (03PS12) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:19:29] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in eqiad: Pooling eqiad for codfw depool today [10:19:39] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey) [10:19:51] !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) pool all active/active services in eqiad: Pooling eqiad for codfw depool today [10:22:20] (03CR) 10Jbond: [C: 03+1] admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron) [10:23:31] (03PS13) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:25:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:25:49] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jijiki) [10:26:15] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) [10:26:30] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [10:26:36] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete [10:26:50] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10taavi) [10:34:58] (03PS1) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) [10:35:06] (03PS14) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [10:35:55] 10Puppet, 10Cloud-VPS, 10Data-Persistence, 10Infrastructure-Foundations, and 3 others: haproxy::site doesn't work as expected on the first puppet run - https://phabricator.wikimedia.org/T321684 (10jbond) I have created a [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/887292 | CR ]] to force full... [10:38:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:39:53] (03Merged) 10jenkins-bot: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [10:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:43] (03PS5) 10Nicolas Fraison: Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915) [10:47:33] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:47:48] (03CR) 10Volans: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [10:48:36] (03CR) 10Btullis: [C: 03+2] Add nfraison to ops group [puppet] - 10https://gerrit.wikimedia.org/r/886891 (https://phabricator.wikimedia.org/T328915) (owner: 10Nicolas Fraison) [10:48:53] (03CR) 10Vgutierrez: [C: 04-1] haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [10:49:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [10:50:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:41] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [10:55:21] (03CR) 10Jbond: [C: 03+1] "lgtm but seems like there is a lot of repetition between the two reimage cookbooks would be nice to have the shared code in one place. ho" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [10:56:10] (03PS1) 10Majavah: openstack: nova: restrict rebuilds to admins [puppet] - 10https://gerrit.wikimedia.org/r/887299 (https://phabricator.wikimedia.org/T302404) [10:57:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus) [10:58:47] (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1100) [11:02:47] (03CR) 10Volans: "reply to comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [11:04:35] (03PS1) 10Muehlenhoff: Assign installserver role to install6002 [puppet] - 10https://gerrit.wikimedia.org/r/887302 (https://phabricator.wikimedia.org/T327867) [11:05:13] (03PS2) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) [11:06:01] (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [11:06:35] (03CR) 10Jbond: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [11:07:22] (03CR) 10Muehlenhoff: [C: 03+2] Assign installserver role to install6002 [puppet] - 10https://gerrit.wikimedia.org/r/887302 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [11:07:45] (03CR) 10Jbond: [C: 03+1] sre.ganeti.reimage: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [11:10:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: cloudgw: make VIPs use /32 netmask [puppet] - 10https://gerrit.wikimedia.org/r/887282 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [11:15:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:07] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DONT MERGE YET. Still testing the effects in codfw1dev." [puppet] - 10https://gerrit.wikimedia.org/r/887288 (https://phabricator.wikimedia.org/T295774) (owner: 10Arturo Borrero Gonzalez) [11:17:34] (03PS1) 10Muehlenhoff: Point webproxy in drmrs to install6002 [dns] - 10https://gerrit.wikimedia.org/r/887304 (https://phabricator.wikimedia.org/T327867) [11:19:25] (03PS1) 10Muehlenhoff: Update tftp server settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/887305 (https://phabricator.wikimedia.org/T327867) [11:20:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:55] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2044.codfw.wmnet with OS bullseye [11:29:50] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1041.eqiad.wmnet with OS bullseye [11:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:41] (03CR) 10Muehlenhoff: [C: 03+2] Point webproxy in drmrs to install6002 [dns] - 10https://gerrit.wikimedia.org/r/887304 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [11:31:11] (03CR) 10Volans: [C: 03+1] "LGTM, just a minor nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [11:31:41] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jbond) [11:33:14] !log installing imagemagick security updates on buster [11:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:55] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage [11:40:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2044.codfw.wmnet with reason: host reimage [11:41:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage [11:41:39] (03CR) 10Jbond: [C: 03+2] postgrers::user::hba: drop hba_label and use title instead [puppet] - 10https://gerrit.wikimedia.org/r/886912 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [11:44:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1041.eqiad.wmnet with reason: host reimage [11:52:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [11:52:25] (03PS3) 10Aklapper: mediawiki: Better error page layout on mobile devices [puppet] - 10https://gerrit.wikimedia.org/r/405058 (https://phabricator.wikimedia.org/T182247) (owner: 10Phantom42) [11:56:36] (03PS10) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [11:56:46] !log Install 10.4.28 on db1152 T329011 [11:56:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2044.codfw.wmnet with OS bullseye [11:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] T329011: Compile and package MariaDB 10.4.28 and 10.6.12 - https://phabricator.wikimedia.org/T329011 [11:58:31] 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney) [11:58:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [11:59:18] 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney) Thanks @MoritzMuehlenhoff, I can roll this into the work to unify the asw configs across the board. We have it automated for similar switches (lsw, cloudsw) el... [12:00:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1041.eqiad.wmnet with OS bullseye [12:04:04] (03CR) 10Muehlenhoff: [C: 03+2] Update tftp server settings for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/887305 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [12:04:27] (03PS11) 10Jbond: postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) [12:07:55] (03CR) 10Jbond: [C: 03+2] postgres::user: add hostname support to postgres user define [puppet] - 10https://gerrit.wikimedia.org/r/886897 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [12:08:48] (03PS8) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:13:02] (03CR) 10Ladsgroup: "I would really prefer to get T326147 done before this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [12:17:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm6001.drmrs.wmnet [12:17:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:17:56] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10herron) [12:19:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm6001.drmrs.wmnet - jmm@cumin2002" [12:20:35] (03CR) 10Btullis: [C: 03+1] "Looks good to me too." [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison) [12:20:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm6001.drmrs.wmnet - jmm@cumin2002" [12:20:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:20:52] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache testvm6001.drmrs.wmnet on all recursors [12:20:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm6001.drmrs.wmnet on all recursors [12:21:12] (03PS9) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:24:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925 [12:24:46] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [12:24:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on doh2001.wikimedia.org with reason: depooled; T327925 [12:25:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:25:52] (03CR) 10Ssingh: [C: 03+1] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [12:26:22] (03PS1) 10Ladsgroup: Migrate Babel config into its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887307 (https://phabricator.wikimedia.org/T308932) [12:26:34] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ssingh) [12:28:11] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:28:19] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:28:33] ^ expected, doh2001 [12:28:45] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:28:55] !log depooling authdns2001 - T327925 [12:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:03] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [12:29:21] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:29:29] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:29:37] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison) [12:29:47] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [12:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm6001.drmrs.wmnet [12:31:05] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:17] (03PS10) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:31:57] 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Shizhao) 05Open→03Invalid >>! 在T328889#8591463中,@Dzahn写道: > Maybe it was fixed since this ticket was... [12:33:06] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [12:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:57] (03PS1) 10Bartosz Dziewoński: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) [12:38:11] (03CR) 10Nicolas Fraison: [C: 03+2] Add information and command rights on icinga to Nicolas Fraison [puppet] - 10https://gerrit.wikimedia.org/r/887276 (owner: 10Nicolas Fraison) [12:38:41] (03PS1) 10Muehlenhoff: Remove Puppet references to theemin [puppet] - 10https://gerrit.wikimedia.org/r/887309 [12:39:18] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Vgutierrez) [12:39:54] (03PS11) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:40:52] (03PS13) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:41:04] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) To depool all services in codfw we will just need to run: ` sudo cookbook sre.discovery.datacenter-route --reason 'T327925' depool codfw ` from on... [12:41:43] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Marostegui) @Joe @akosiaris I assume we'll depool codfw for this one too? [12:42:57] (03CR) 10Vgutierrez: haproxy: force a full restart when /etc/defaults/haproxy is changed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [12:43:33] (03CR) 10CI reject: [V: 04-1] Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński) [12:43:43] (03PS1) 10Muehlenhoff: Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867) [12:43:53] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:44:47] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) See https://commons.wikimedia.org/w/index.php?title=Commons%3AVillage_pump&diff=prev&oldid=7306822... [12:46:11] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Joe) Please note: this won't depool `docker-registry`, which will still be active in codfw for the duration of the maintenance. [12:48:13] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [12:50:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references to theemin [puppet] - 10https://gerrit.wikimedia.org/r/887309 (owner: 10Muehlenhoff) [12:50:41] (03PS12) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:51:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39426/console" [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [12:53:55] (03PS2) 10Muehlenhoff: Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867) [12:54:35] (03PS1) 10Bartosz Dziewoński: Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) [12:55:08] (03PS14) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:55:10] (03PS2) 10Bartosz Dziewoński: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) [12:55:28] 10SRE, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover SRE-side Communication - https://phabricator.wikimedia.org/T329042 (10Clement_Goubert) [12:56:58] (03PS13) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [12:58:09] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/887310 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [12:58:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Yeah, makes sense to make other wmf.21 patches work I guess." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [12:58:27] (03PS15) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:58:33] (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [13:00:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:46] (03CR) 10EoghanGaffney: "Hey Filippo, wondering if you have an opinion on this -- this option should work, but I don't know if we'd be better exploring another app" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:02:50] (03PS1) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867) [13:03:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:48] !log diable puppet in codfw, ulsfo and esams for switch upgrade T327925 [13:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:52] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [13:06:19] (03PS2) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867) [13:08:25] (03PS3) 10Muehlenhoff: Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867) [13:10:54] (03CR) 10Muehlenhoff: [C: 03+2] Drop installserver role from install6001 [puppet] - 10https://gerrit.wikimedia.org/r/887311 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:11:14] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [13:11:17] !log oblivian@cumin2002 START - Cookbook sre.discovery.datacenter-route depool all active/active services in codfw: T327925 [13:11:21] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [13:12:59] !log enable puppet in codfw, ulsfo and esams to allow depools post switch upgrade T327925 [13:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:38] (03PS1) 10Jbond: wmnet: swap esams and esqin for the puppet CNAME [dns] - 10https://gerrit.wikimedia.org/r/887314 [13:17:50] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/887314 (owner: 10Jbond) [13:19:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:28] (03PS14) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [13:23:30] (03PS1) 10Jbond: postgresql::user: fix filter statement [puppet] - 10https://gerrit.wikimedia.org/r/887316 [13:24:25] PROBLEM - TFTP service on install6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [13:24:57] PROBLEM - HTTP on install6001 is CRITICAL: connect to address 185.15.58.7 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [13:25:23] PROBLEM - Squid on install6001 is CRITICAL: connect to address 185.15.58.7 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:26:10] ^ expected, replaced by install6002 [13:26:50] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10jcrespo) [13:27:02] (03CR) 10Jbond: [C: 03+2] postgresql::user: fix filter statement [puppet] - 10https://gerrit.wikimedia.org/r/887316 (owner: 10Jbond) [13:27:14] (03PS1) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) [13:28:47] (03CR) 10CI reject: [V: 04-1] sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [13:29:38] (03PS15) 10Jbond: puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) [13:30:18] (03PS2) 10Vgutierrez: admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) [13:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39431/console" [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [13:31:10] (03CR) 10Vgutierrez: [C: 03+2] admin_state: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/887284 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [13:31:50] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 199 hosts with reason: codfw row A upgrade [13:31:50] !log depool codfw edge site - T327925 [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:55] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [13:32:43] !log oblivian@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) depool all active/active services in codfw: T327925 [13:33:07] (03CR) 10Filippo Giunchedi: "Thank you for" [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [13:33:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:53] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) For the record, full row hosts downtime done with: `sudo cookbook sre.hosts.downtime --hours 2 -r "codfw row A upgrade" -t T327925 'P{P:netbox::... [13:33:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 199 hosts with reason: codfw row A upgrade [13:34:20] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=295bf4d5-8856-488b-9ca9-06a0ff06db18) set by ayounsi@cumin1001 for 2:00:00 on 199 host(s... [13:36:32] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [dns] - 10https://gerrit.wikimedia.org/r/887314 (owner: 10Jbond) [13:37:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: use new allowed_hosts paramater to postgresql:user [puppet] - 10https://gerrit.wikimedia.org/r/886900 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [13:37:43] (03PS2) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) [13:41:39] (03CR) 10Ottomata: [C: 03+1] "I think in this case it will be fine to just deploy this. eventgate-analytics-external should have a short lived cache this stream config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [13:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:06] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/887292 (https://phabricator.wikimedia.org/T321684) (owner: 10Jbond) [13:49:01] jouncebot: nowandnext [13:49:01] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [13:49:01] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400) [13:49:01] In 0 hour(s) and 10 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400) [13:49:25] I’ll already backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/886981 to save a bit of time during the window [13:49:28] should be a no-op [13:49:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [13:49:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:53:41] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @Jhancock.wm can you please check the switch port for mw2423, it looks like i have already a server connected to port 41. Thanks [13:54:25] RECOVERY - Check systemd state on puppetdb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:37] (03PS11) 10Andrew Bogott: Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) [13:54:39] (03PS1) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [13:54:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2422 and 24 DNS - pt1979@cumin2002" [13:55:04] (03CR) 10CI reject: [V: 04-1] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [13:55:17] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:55:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2422 and 24 DNS - pt1979@cumin2002" [13:55:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:15] !log depool ms-fe2009 T327925 [13:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:19] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [13:56:23] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:56:29] Lucas_WMDE: We have a large maintenance starting in 5min, it shouldn't be more than 30min downtime, is it ok to postpone any deployment? [13:56:55] uh [13:57:07] does that mean the UTC afternoon backport window won’t be happening? [13:57:40] my backport on its own isn’t important, but MatmaRex wanted to backport a DiscussionTools fix that depends on it [13:57:45] hi [13:57:53] :o [13:58:03] is something down? [13:58:16] it’s about to be, apparently [13:58:37] I assume this is the row switch stuff, but I didn’t know that was going to overlap with the backport window [13:58:40] that's the main task for the maintenance https://phabricator.wikimedia.org/T327925 [13:58:47] yeah I didn't know neither [13:59:04] i will be sad, but i can backport later too [13:59:05] (03PS2) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [13:59:08] there wasn't anything on the calendar :/ [13:59:24] I’ll remove my +2 then [13:59:26] (03CR) 10CI reject: [V: 04-1] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [13:59:26] and cancel the deploy [13:59:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Rescinding deployment, T327925 is about to happen." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [13:59:57] !log disable puppet in ulsfo/esams/codfw for codfw row A switch upgrade - T327925 [13:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400). [14:00:04] ottomata and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400) [14:00:09] MatmaRex: which calendar? I can try to update it there as well [14:00:11] * urbanecm waves [14:00:17] (03PS5) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [14:00:23] XioNoX: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1400 [14:00:30] o/ [14:00:52] but it looks like the window's postponed, and that Lucas_WMDE is around to staff it [14:00:57] I’m not around, actually [14:00:58] XioNoX: https://wikitech.wikimedia.org/wiki/Deployments [14:01:01] I’m in a meeting now [14:01:10] but I think the whole window isn’t happening [14:01:11] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:01:27] ah [14:01:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:01:46] and whoever scheduled that switch maintenance, next time please put it in the deployment calendar? [14:02:17] oh, backport window not happening? [14:02:30] probably not, due to https://phabricator.wikimedia.org/T327925 being scheduled for the same time [14:02:36] wtihout, apparently, anybody realising it until five minutes ago [14:02:44] ottomata: XioNoX said something about a switch maintenance [14:02:59] (03CR) 10Volans: [C: 03+1] "LGTM, wording nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:03:03] Basically, all of row A in codfw will be unreachable [14:03:26] To facilitate the maintenance, we decided to depool codfw completely [14:03:51] if the switch maintenance only takes 30 minutes, someone™ could in theory deploy at least part of the changes afterwards [14:03:54] hm, okay. does that mean we can't deploy? [14:03:57] (but not me, I’ll be in said meeting until the end of the hour) [14:04:03] someone special hopefully :) [14:04:16] iiuc deplooing shouldn't affect scap deployments? [14:04:26] depends on how it's done [14:04:32] and none of the servers listed in that ticket are mw app servers [14:04:41] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:04:42] anyway, prob best to wait [14:04:42] okay [14:04:53] would love it if a deployer could help when the switch thing is done [14:05:08] There's mw[2291-2309,2377-2411] and parse[2001-2005] [14:05:21] OH, oops [14:05:21] yeah in theory the deployment could happen, but it's mostly to minimize the number of changes and moving parts in the same time window [14:05:23] okay. missed that [14:05:26] yeah [14:05:28] gotcha [14:05:29] let's wait [14:05:44] could someone ping me after said maintenance finishes? i'll try to deploy some patches of the backport window at least [14:06:23] yes thanks i'll do that urbanecm [14:06:29] ty [14:06:29] (03PS3) 10Jelto: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) [14:06:48] thanks [14:06:51] !log asw-a-codfw> request system reboot all-members - T327925 [14:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [14:07:25] !log lucaswerkmeister-wmde@deploy1002 backport aborted: (duration: 17m 46s) [14:07:49] (`scap backport` didn’t finish on its own after the core gate-and-submit finished but didn’t merge, so I Ctrl+Ced it) [14:07:54] * Lucas_WMDE done [14:08:04] ack Lucas_WMDE, thanks [14:08:45] !log disable puppet in codfw, uslfo, esams for switch upgrade T327925 [14:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:57] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:02] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:09:03] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:09:23] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:09:23] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:01] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:01] XioNoX: expected ^^^^? [14:10:06] the asw-X [14:10:18] (ProbeDown) firing: Service thanos-web:443 has failed probes (http_thanos-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-web:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:21] volans: it's their mgmt interface [14:10:25] restbase i saw an alert for earlier so may be unrelated [14:10:26] I guess it's just the mgmt and that the check was quicker than the dependency in icinga [14:10:38] page acked [14:10:48] volans: I downtimed mr1 but looks like icinga considred them as down before seeing mr1 as down, so parent/child didn't kick in [14:11:03] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:05] what about thanos? [14:11:07] PROBLEM - Host ripe-atlas-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:35] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:43] godog: should we worry about thanos? [14:12:09] volans: no, I'll pool another host [14:12:22] thanos-web is the web interface, not a huge deal and it is active/active [14:12:29] ack [14:12:32] (virtual-chassis crash) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:13:05] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2391.codfw.wmnet, mw2378.codfw.wmnet, mw2387.codfw.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2295.codfw.wmnet, mw2302.codfw.wmnet, mw2293.codfw.wmnet, mw2298.codfw.wmnet, mw2372.codfw.w [14:13:05] 2299.codfw.wmnet, mw2400.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:13:17] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:13:17] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:13:20] errr [14:13:23] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:13:25] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:13:31] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2407.codfw.wmnet, mw2391.codfw.wmnet, mw2408.codfw.wmnet, mw2389.codfw.wmnet, mw2384.codfw.wmnet are marked down but pooled: thanos-web_443: Servers thanos-fe2001.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2396.codfw.wmnet, mw2295.codfw.wmnet, mw2298.codfw.wmnet, mw2356.codfw.wmnet, mw2402.codfw.w [14:13:31] 2299.codfw.wmnet, mw2294.codfw.wmnet, mw2405.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:13:44] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2001.codfw.wmnet,service=thanos-web [14:13:51] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web [14:13:55] claime: weren't the mediawiki hosts depooled? [14:13:58] (KubernetesCalicoDown) firing: ml-staging2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:58] (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:03] (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:03] (KubernetesCalicoDown) firing: (4) kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:05] Well they were supposed to be [14:14:07] PROBLEM - MariaDB Replica IO: backup1-codfw on db2184 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2183.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2183.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:14:07] (ProbeDown) firing: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:35] jynus: ^ wnat me to take care of that? [14:14:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:14:49] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:14:54] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:15:04] marostegui: I thought you had handled it [14:15:06] I can [14:15:16] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10akosiaris) >>! In T327991#8593396, @Marostegui wrote: > @Joe @akosiaris I assume we'll depool codfw for this one too? Yeah, as a team we are similarly a... [14:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:19] the thanos compact alert is fine [14:15:23] (ProbeDown) resolved: Service thanos-web:443 has failed probes (http_thanos-web_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-web:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:40] jynus: I didn't see that one the list of things [14:15:41] thanks godog [14:15:45] (JobUnavailable) firing: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:09] godog: should it page if it's not a huge deal? (thanos-web) [14:16:10] marostegui: maybe another case of desync between setup and maintenance [14:16:25] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:05] jynus: yeah, i didn't see that one at https://phabricator.wikimedia.org/T327925 I saw db2183 but I didn't know it had that replica [14:17:19] XioNoX: yeah that's fair, probably we don't need to page, I'll send a review after the maint [14:17:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:18:27] Depooling mw appservers [14:18:30] no prob, I have downtimed it and will check it afterwards [14:18:36] claime: thx [14:18:37] thanks [14:18:46] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=api_appserver [14:18:58] (KubernetesCalicoDown) firing: (2) ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:58] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:58] (KubernetesCalicoDown) firing: (5) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:19:04] (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:35] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=jobrunner [14:19:45] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=appserver [14:19:48] (03PS1) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 [14:19:59] (03CR) 10Elukey: "Thanks a lot for the quick round of reviews folks! I think that we are ready to merge?" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [14:19:59] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.23 ms [14:20:03] RECOVERY - Host asw-c-codfw is UP: PING WARNING - Packet loss = 33%, RTA = 33.76 ms [14:20:03] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.57 ms [14:20:07] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01001 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:20:17] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [14:20:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:20:39] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms [14:20:43] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [14:20:45] (JobUnavailable) firing: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:47] (03CR) 10EoghanGaffney: "Massively simplified approach to add the otrs logs to kafka!" [puppet] - 10https://gerrit.wikimedia.org/r/887321 (owner: 10EoghanGaffney) [14:20:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:03] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:21:23] RECOVERY - MariaDB Replica IO: backup1-codfw on db2184 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:21:30] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid [14:21:46] (03Abandoned) 10EoghanGaffney: Separate log messages from otrs.Daemon.pl to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/886361 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:23:01] RECOVERY - Host ripe-atlas-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.92 ms [14:23:19] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:23] (ProbeDown) firing: (4) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:23:34] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:23:40] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:23:45] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [14:23:47] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:57] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:24:03] (KubernetesCalicoDown) resolved: ml-staging2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:07] (KubernetesCalicoDown) resolved: (2) ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:12] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:16] (KubernetesCalicoDown) resolved: (5) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:19] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.28:443, 10.2.1.22:443, 10.2.1.26:443, 10.2.1.1:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:24:19] (ProbeDown) firing: (6) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:24:22] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:24:24] Lucas_WMDE, MatmaRex, the upgrade itself is successful, we're doing customary checks [14:24:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Check console cable for asw-a2-codfw - https://phabricator.wikimedia.org/T329055 (10cmooney) p:05Triage→03Low [14:24:43] nice, thanks for the note [14:24:45] (03CR) 10Elukey: [C: 03+2] ml-services: add autoscaling to en/wikidata for goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/886917 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:24:49] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:24:54] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:24:59] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:25:00] (03PS4) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [14:25:04] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:25:18] (03PS2) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) [14:25:25] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:25:34] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:25:35] (KafkaUnderReplicatedPartitions) firing: (2) Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:25:45] (JobUnavailable) resolved: (25) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:37] !log depooled appserver, api_appserver, jobrunner, parsoid - T327925 [14:26:39] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.28:443, 10.2.1.22:443, 10.2.1.1:443, 10.2.1.26:443]) https://wikitech.wikimedia.org/wiki/PyBal [14:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [14:27:07] (03CR) 10CI reject: [V: 04-1] Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:27:54] (03PS3) 10EoghanGaffney: Sends the otrs.Daemon.pl log messages to kafka/logstash [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) [14:27:59] !log enable puppet in codfw, uslfo, esams post switch upgrade T327925 [14:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:28] (ProbeDown) firing: (5) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:34] (virtual-chassis crash) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [14:28:38] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:28:41] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:28:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:28:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:29:04] (ProbeDown) firing: (7) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [14:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:20] !log pool ms-fe2009 (codfw as a whole still depooled) T327925 [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:23] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [14:34:12] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [14:34:33] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887321 (https://phabricator.wikimedia.org/T321759) (owner: 10EoghanGaffney) [14:35:01] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid [14:35:04] (03CR) 10Jbond: [C: 03+1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [14:35:15] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=appserver [14:35:28] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=jobrunner [14:35:43] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002003 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:35:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:02] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=api_appserver [14:36:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:codfw and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [14:36:25] !log repooled appserver, api_appserver, jobrunner, parsoid - T327925 [14:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:48] (ProbeDown) resolved: (4) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:51] (03PS1) 10Vgutierrez: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 [14:39:04] (03PS2) 10Vgutierrez: Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) [14:39:04] (ProbeDown) resolved: (4) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:13] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:40:13] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:40:21] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe2001.codfw.wmnet,service=thanos-web [14:40:33] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe2002.codfw.wmnet,service=thanos-web [14:41:31] claime: following at 10%, please lemme know when okay to deploy :) [14:41:43] ottomata: will do [14:42:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:41] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 177, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:43] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:43] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:42:46] wb doh2001 [14:43:32] (03PS3) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [14:46:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:28] o/, noted T329056 just as whatever all that ^ was, related? [14:46:29] T329056: beta-code-update-eqiad: FATAL: java.io.IOException: Unexpected termination of the channel - https://phabricator.wikimedia.org/T329056 [14:46:49] !log volans@cumin2002 START - Cookbook sre.discovery.datacenter-route pool all active/active services in codfw: T327925 [14:46:52] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [14:49:22] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Jclark-ctr) @Marostegui can drive be swapped as soon as it arrives? eta is today but unsure what time it will arrive [14:49:46] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Marostegui) @Jclark-ctr yes, you can do it whenever you want. [14:51:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:15] !log adding nfraison to pwstore T328915 [14:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:20] 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10ayounsi) p:05Triage→03High [14:55:47] (03PS4) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [14:57:11] (03PS1) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) [14:57:21] (03PS1) 10Filippo Giunchedi: hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 [14:58:10] (03CR) 10AikoChou: [C: 03+1] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:59:04] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [14:59:05] !log dbmaint deploy schema change on s6 T328828 [14:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:08] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [15:00:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:43] hi, what's the status of the maintenance please? [15:00:52] !log restart pybal in lvs2009 - T327925 [15:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:56] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [15:00:57] we're repooling things in codfw [15:01:05] !log dbmaint deploy schema change on s6 T328807 [15:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:08] T328807: Drop cul_reason from cu_log on wmf wikis - https://phabricator.wikimedia.org/T328807 [15:01:17] it should not take too much longer urbanecm [15:01:21] ack [15:02:41] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:02:59] (03PS5) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:03:28] 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10cmooney) Seems to have gone down last Monday week (Jan 30th) ` Jan 30 17:32:20 re0.cr1-codfw mib2d[31964]: SNMP_TRAP_LINK_DOWN: ifIndex 647, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-1/0/1:1 ` Perhaps some ca... [15:04:37] !log restart pybal in lvs2010 - T327925 [15:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:08] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue - see T329059 - The acknowledgement expires at: 2023-02-13 10:04:30. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:18] (03CR) 10Ladsgroup: "generally looks good. One note." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [15:05:26] !log dbmaint deploy schema change on s8 T328807 T328828 [15:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:30] T328828: Remove default on cul_reason_id and cul_reason_plaintext_id in the cu_log table on wmf wikis - https://phabricator.wikimedia.org/T328828 [15:05:38] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue - see T329059 - The acknowledgement expires at: 2023-02-13 10:05:11. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:43] (03PS6) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:06:07] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:06:10] (03CR) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [15:06:36] (03PS2) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) [15:06:46] (03CR) 10Marostegui: cuc_user_cuc_user_text_T328817.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [15:07:56] !log volans@cumin2002 END (PASS) - Cookbook sre.discovery.datacenter-route (exit_code=0) pool all active/active services in codfw: T327925 [15:07:59] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [15:08:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [15:09:02] !log volans@cumin2002 START - Cookbook sre.discovery.service-route depool restbase-async in eqiad: T327925 [15:09:03] !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [15:09:07] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [15:10:26] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [15:10:40] (03CR) 10Marostegui: [C: 03+2] cuc_user_cuc_user_text_T328817.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/887324 (https://phabricator.wikimedia.org/T328817) (owner: 10Marostegui) [15:11:14] (03CR) 10Ssingh: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [15:11:20] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [15:11:27] (03CR) 10Clément Goubert: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [15:11:40] (03CR) 10Vgutierrez: [C: 03+2] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [15:11:50] (03CR) 10Ayounsi: [C: 03+1] Revert "admin_state: depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/886984 (https://phabricator.wikimedia.org/T327925) (owner: 10Vgutierrez) [15:11:53] (03PS7) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:12:07] ottomata: urbanecm: You should be ok to go [15:12:18] ty [15:12:21] !log repool codfw edge site - T327925 [15:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:45] MatmaRex: if you're still around, do you want to go ahead with the deployment? [15:13:06] oh [15:13:08] (03CR) 10Urbanecm: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [15:13:12] sure. i actually have some time [15:13:18] ottomata: see my comment on the config patch please [15:13:30] i was about the reschedule it. thanks :) [15:13:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2045.codfw.wmnet with OS bullseye [15:13:53] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1043.eqiad.wmnet with OS bullseye [15:13:55] there's nothing officially scheduled in the calendar, so it should be fine to go ahead [15:14:00] (03CR) 10Urbanecm: [C: 03+2] Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński) [15:14:02] (03CR) 10Urbanecm: [C: 03+2] Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [15:14:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [15:14:06] !log volans@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in eqiad: T327925 [15:14:09] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [15:14:18] (03PS2) 10Urbanecm: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński) [15:14:23] (03CR) 10Urbanecm: [C: 03+2] Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński) [15:14:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [15:15:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński) [15:15:06] (03Merged) 10jenkins-bot: Add "Page Frame" to DiscussionTools beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886997 (https://phabricator.wikimedia.org/T327456) (owner: 10Bartosz Dziewoński) [15:15:37] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]] [15:15:40] T327456: [Config Change] Add Page Frame to beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T327456 [15:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:49] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [15:17:32] !log urbanecm@deploy1002 matmarex and urbanecm: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:17:44] MatmaRex: please test at mwdebug1001 and let me know how it goes :) [15:18:39] looking [15:19:45] urbanecm: works as expected [15:20:11] ty, syncing [15:20:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [15:22:19] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:22:26] urbanecm: looking [15:22:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:58] (03PS8) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:23:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P43760 and previous config saved to /var/cache/conftool/dbconfig/20230207-152337-root.json [15:25:35] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage [15:25:42] (03PS9) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:25:49] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [15:26:12] (03CR) 10Jforrester: [C: 03+1] "Oy." [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [15:26:16] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886997|Add "Page Frame" to DiscussionTools beta feature on enwiki (T327456)]] (duration: 10m 39s) [15:26:19] T327456: [Config Change] Add Page Frame to beta feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T327456 [15:26:26] MatmaRex: patch should be live (backport waiting on CI) [15:27:35] (03PS10) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [15:28:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1043.eqiad.wmnet with reason: host reimage [15:28:30] (03CR) 10Btullis: [C: 03+1] profile::statistics::explorer::ml: add opencl packages [puppet] - 10https://gerrit.wikimedia.org/r/887285 (owner: 10Elukey) [15:28:54] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Clement_Goubert) [15:29:07] (03Abandoned) 10Jdrewniak: Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/883618 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [15:29:15] (03Abandoned) 10Jdrewniak: Account for temporary row in grid template row [skins/Vector] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/883619 (https://phabricator.wikimedia.org/T327714) (owner: 10Jdrewniak) [15:29:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [15:29:27] (03PS1) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) [15:29:42] (03CR) 10CI reject: [V: 04-1] install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) (owner: 10Jelto) [15:29:45] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage [15:29:51] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [15:30:16] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [15:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:40] (03Merged) 10jenkins-bot: Pin PHPUnit to 9.5.x [core] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886981 (https://phabricator.wikimedia.org/T328741) (owner: 10Bartosz Dziewoński) [15:30:42] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) (owner: 10Jbond) [15:30:44] (03Merged) 10jenkins-bot: Don't add custom attributes in unwrapParsoidSections() [extensions/DiscussionTools] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/886980 (https://phabricator.wikimedia.org/T328268) (owner: 10Bartosz Dziewoński) [15:30:59] (03PS6) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) [15:31:19] (03CR) 10Ottomata: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [15:32:00] (03PS2) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) [15:32:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage [15:33:04] PROBLEM - MariaDB Replica IO: es5 on es2024 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1593, Errmsg: Fatal error: Failed to run after_read_event hook https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:33:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:47] ^ marostegui [15:33:48] MatmaRex: your backport is at mwdebug1001 now, can you check please? [15:33:58] not using scap backport this time because of T323277 [15:33:59] T323277: scap backport: Multiple changes found for Ifb0316256bdec5008acc48544ddd3e2bf71b6d41 - https://phabricator.wikimedia.org/T323277 [15:34:17] checking [15:34:27] looking [15:34:35] same thing that happened to es2020 last time [15:36:08] RECOVERY - MariaDB Replica IO: es5 on es2024 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:36:18] 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) Any reason why it went down last Monday and we are just seeing it now today after a week? [15:36:51] Looks related to semi sync from what I can see [15:36:54] It is fixed now though [15:37:38] urbanecm: looks good. sorry about the delay [15:37:55] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:38:18] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [15:38:22] urbanecm: urbanecm fixed my cofnig patch. also...i added one more eventbus patch to deploy, if that is okay. [15:38:28] was just reported to me and i tthink its unbreak now. [15:38:32] unrelated to the other two patches. [15:38:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P43761 and previous config saved to /var/cache/conftool/dbconfig/20230207-153842-root.json [15:38:55] ottomata: ack, thanks [15:38:59] MatmaRex: no problem, syncing [15:39:29] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) 05Open→03Resolved a:03ayounsi The upgrade was smooth, ~15min hard downtime. No user impact, all the depools did their job. There was some... [15:39:38] !log urbanecm@deploy1002 Started scap: 20a79c55b7073e791e297a5389fa66819f596178: Don't add custom attributes in unwrapParsoidSections() (T328268) [15:39:41] T328268: Dirty diffs in headings in edits made with reply tool - https://phabricator.wikimedia.org/T328268 [15:40:06] 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10ayounsi) We're not looking at Icinga often enough :) [15:40:37] 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) make sense [15:40:40] ottomata: eh, i accidentally merged the other patch in master :D. hopefully we don't break anything with it [15:41:04] (03PS1) 10Urbanecm: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) [15:41:17] (03PS1) 10Urbanecm: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) [15:41:27] (03CR) 10Urbanecm: [C: 03+2] Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [15:41:31] (03CR) 10Urbanecm: [C: 03+2] Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [15:41:53] urbanecm: none of these patches should break anything [15:42:05] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: Add rollback() method and improve logging (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:42:06] the latest one i added will fix somethign that I accidentally broke back in october. :( [15:42:11] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [15:42:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) [15:42:20] :( [15:42:34] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Performance-Team (Radar): Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Clement_Goubert) p:05Triage→03High [15:42:56] (03PS1) 10Urbanecm: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017) [15:43:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1043.eqiad.wmnet with OS bullseye [15:43:08] (03PS1) 10Urbanecm: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017) [15:45:22] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: Add rollback() method and improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887317 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:46:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:13] !log urbanecm@deploy1002 Finished scap: 20a79c55b7073e791e297a5389fa66819f596178: Don't add custom attributes in unwrapParsoidSections() (T328268) (duration: 07m 34s) [15:47:17] T328268: Dirty diffs in headings in edits made with reply tool - https://phabricator.wikimedia.org/T328268 [15:47:35] MatmaRex: backport's live [15:47:43] thanks [15:47:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [15:47:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [15:47:59] np [15:48:33] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2045.codfw.wmnet with OS bullseye [15:50:01] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 12 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10colewhite) [15:51:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:44] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10colewhite) [15:52:49] (03PS6) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [15:53:42] !log installing tiff security updates [15:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P43762 and previous config saved to /var/cache/conftool/dbconfig/20230207-155347-root.json [15:54:03] (03PS2) 10Jcrespo: Revert "dbbackups: Delay codfw es (db content) backups by one day" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925) [15:54:33] (03CR) 10Jcrespo: [C: 03+2] Revert "dbbackups: Delay codfw es (db content) backups by one day" [puppet] - 10https://gerrit.wikimedia.org/r/886812 (https://phabricator.wikimedia.org/T327925) (owner: 10Jcrespo) [15:55:12] (03CR) 10EoghanGaffney: [C: 03+2] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney) [15:55:41] (03PS3) 10Jelto: install_server: add custom partman config for gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) [15:56:43] (03CR) 10Andrew Bogott: [C: 03+2] Split cinder-volume.conf out from cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/886934 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [15:57:20] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:45] 10SRE, 10ops-codfw: codfw: one of cr1-cr2 link down - https://phabricator.wikimedia.org/T329059 (10Papaul) 05Open→03Resolved cable unplugged `` papaul@re0.cr2-codfw> show interfaces terse xe-1/0/1:2 Interface Admin Link Proto Local Remote xe-1/0/1:2 up up xe... [15:59:50] (03Merged) 10jenkins-bot: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/886985 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [15:59:52] (03Merged) 10jenkins-bot: Restore mediawiki.page-undelete hook [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887346 (https://phabricator.wikimedia.org/T329064) (owner: 10Urbanecm) [16:00:22] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]] [16:00:26] T329064: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 [16:02:13] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [16:02:31] ottomata: can you please check the page-undelete hook at mwdebug, if possible? [16:03:30] 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 4 others: Add SPF records for gitlab.wikimedia.org - https://phabricator.wikimedia.org/T328642 (10eoghan) 05Open→03Resolved I've deployed the softfail records and checked that they're in place: ` ❯ for i in 0 1 2; do ns=ns${i}.wikimedia.org... [16:03:44] (03PS34) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [16:04:04] (03PS53) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [16:04:06] (03PS14) 10Raymond Ndibe: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [16:07:12] urbanecm: i can try... i need to be able to undelete a page to do that, attempting... [16:07:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [16:07:42] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:08:25] urbanecm: i think i don't have permissions on testwiki to delete and undelete, do you? [16:08:29] PROBLEM - Kafka Broker Server #page on kafka-logging2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [16:08:41] * volans here [16:08:44] ottomata: yes. i can also give you sysop permissions at test.wikipedia if you tell me your username [16:08:48] PROBLEM - Kafka broker TLS certificate validity on kafka-logging2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [16:08:50] Ottomata [16:08:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P43763 and previous config saved to /var/cache/conftool/dbconfig/20230207-160852-root.json [16:09:00] acked page [16:09:12] PROBLEM - Check systemd state on kafka-logging2001 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:21] * jhathaway here as well [16:09:27] volans: need any assistance? [16:09:27] here [16:09:43] kafka.service: Main process exited, code=exited, status=143/n/a [16:09:49] ottomata: done [16:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:37] it works urbanecm thank you. [16:10:39] something stopped kafka? [16:10:39] Feb 7 12:14:03 kafka-logging2001 systemd[1]: Stopping Kafka Broker... [16:10:44] proceed with deploy [16:10:47] thanks [16:10:58] before i proceed, i see something broke :-/ [16:11:03] oh [16:11:04] ? [16:11:05] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:11:24] volans: it has puppet stopped with a message by Keith [16:11:25] volans: yeah looks like it [16:11:29] ah! [16:11:33] this is just expired downtime I expect [16:11:41] herron: ^^^^ [16:11:48] is that WIP on your side? [16:11:49] (03CR) 10David Caro: [C: 03+2] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [16:11:51] moritzm: good catch, thanks [16:11:52] "switch maintenance, kafka stopped --herron" [16:12:03] (03CR) 10David Caro: replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [16:12:14] to SREs: i've a mediawiki deploy in progress, is it okay to let it finish? [16:12:19] urbanecm: yes [16:12:21] ok [16:12:23] thanks [16:12:26] ottomata: proceeding [16:13:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2003 is CRITICAL: 78 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [16:13:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2002 is CRITICAL: 80 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [16:14:02] the procedure in the task says [16:14:02] start kafka service, confirm kafka logging dashboard returns green [16:14:05] asking in o11y [16:14:50] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1045.eqiad.wmnet with OS bullseye [16:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2046.codfw.wmnet with OS bullseye [16:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:04] (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm) [16:16:06] (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm) [16:17:14] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [16:17:22] (03PS9) 10Superpes15: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) [16:17:51] (03CR) 10David Caro: [C: 03+2] replica_cnf_web: add functional tests [puppet] - 10https://gerrit.wikimedia.org/r/867566 (https://phabricator.wikimedia.org/T304040) (owner: 10David Caro) [16:18:07] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886985|Restore mediawiki.page-undelete hook (T329064)]], [[gerrit:887346|Restore mediawiki.page-undelete hook (T329064)]] (duration: 17m 44s) [16:18:10] T329064: mediawiki.page-undelete stream is empty - https://phabricator.wikimedia.org/T329064 [16:18:26] ottomata: first patch's live everywhere now [16:18:32] waiting on CI for the other one [16:18:37] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Jhancock.wm) We did some more troubleshooting and it looks like the slot for DIMM_B4 is bad. This may need a MB replacement to fully fix. [16:18:55] awesome thank you, i just saw another undelete come through on sr.wikipedia.org, so its working everywhere [16:20:06] (03PS11) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [16:20:08] (03PS1) 10Andrew Bogott: cinder-volume init module: move SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/887341 [16:20:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:27] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [16:22:06] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [16:23:46] (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume init module: move SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/887341 (owner: 10Andrew Bogott) [16:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P43764 and previous config saved to /var/cache/conftool/dbconfig/20230207-162357-root.json [16:24:06] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) We 've discussed this internally within the team. **We realize that it's not possible to exclude wikitech from the s... [16:24:17] RECOVERY - Kafka Broker Server #page on kafka-logging2001 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [16:24:36] RECOVERY - Kafka broker TLS certificate validity on kafka-logging2001 is OK: SSL OK - Certificate kafka-logging2001.codfw.wmnet valid until 2023-09-12 07:55:00 +0000 (expires in 216 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [16:24:58] RECOVERY - Check systemd state on kafka-logging2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:07] now to make single broker down stop paging 🤨 [16:25:16] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [16:25:35] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster logging-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:25:42] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2002 [16:26:31] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage [16:26:39] (03PS7) 10Urbanecm: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [16:27:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Check console cable for asw-a2-codfw - https://phabricator.wikimedia.org/T329055 (10Papaul) 05Open→03Resolved a:03Papaul The port was moved on the console server from port 18 to port 41 some days back when we did have some issues but I never... [16:28:40] (03CR) 10Urbanecm: [C: 03+2] Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885880 (https://phabricator.wikimedia.org/T308017) (owner: 10Ottomata) [16:30:05] 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [16:30:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:39] (03PS1) 10Herron: kafka-logging: don't page on individual broker down [puppet] - 10https://gerrit.wikimedia.org/r/887342 [16:31:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage [16:31:20] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage [16:31:52] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) @nskaggs, @bd808 (feel free to add others), let me know what you think. [16:32:47] (03Merged) 10jenkins-bot: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887347 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm) [16:32:49] (03Merged) 10jenkins-bot: Finalize mediawiki/page/change schema at 1.0.0 [extensions/EventBus] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/887348 (https://phabricator.wikimedia.org/T308017) (owner: 10Urbanecm) [16:32:53] finally [16:33:15] (03CR) 10Herron: [C: 03+1] hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi) [16:33:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:09] ottomata: pulled to mwdebug1001, can you test it there please? [16:34:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage [16:34:19] (03PS3) 10Jforrester: Move non-variant wgMFNearby to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833770 [16:34:23] (03CR) 10Herron: [C: 03+1] logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [16:34:44] urbanecm: both the config change and the eventbus chagne? [16:34:50] that is correct [16:35:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr) ms-fe1013 D2 U35 PORT13 4902 ms-fe1014 F1 U38. PORT 20220049 thanos-fe1004 F1 U3... [16:35:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr) [16:35:42] i think...i can't test this unless the config change is on meta. hm. [16:36:39] ottomata: wdym? the config change is at mwdebug1001 meta [16:36:50] does it need to be at production meta for some reason? [16:38:36] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10bd808) >>! In T328768#8594394, @akosiaris wrote: > @nskaggs, @bd808 (feel free to add others), let me know what you think. Opt... [16:38:50] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) Please apologies if I am wrong, which I am probably am, but... > Wikitech read requests will flow to eqiad, and write... [16:39:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P43765 and previous config saved to /var/cache/conftool/dbconfig/20230207-163902-root.json [16:39:03] i can trigger the event production from testwiki [16:39:19] but, it gets produced to an eventgate instance, which looks up global stream config from metawiki [16:40:39] ah [16:40:45] i could maybe tail eventgate logs and see it attempt to receive the new eventstream and error [16:40:47] so then i think we need to sync, and hope [16:40:51] or that [16:40:59] your call :) [16:41:00] lets just sync, this stream is non production, nothing will break [16:41:03] okay [16:41:05] !log urbanecm@deploy1002 Started scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017) [16:41:07] sync started [16:41:09] T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 [16:41:11] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi) [16:46:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1045.eqiad.wmnet with OS bullseye [16:48:37] !log urbanecm@deploy1002 Finished scap: 58f4d877: Finalize mediawiki/page/change schema, produce at rc1.mediawiki.page_change (T308017), 854ff4ac: Finalize mediawiki/page/change schema at 1.0.0 (T308017) (duration: 07m 32s) [16:48:40] T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 [16:48:46] ottomata: and live [16:48:50] and i think we're done now [16:49:05] ok checkign! [16:49:46] it works! Thank you urbanecm [16:49:58] awesome [16:50:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2046.codfw.wmnet with OS bullseye [16:50:18] 10SRE, 10Observability-Alerting, 10observability: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10colewhite) Hmm... It appears there is a silence management UI in AlertManager, but the supporting UI code is not deployed with the deb package. In additi... [16:51:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [16:52:50] (03PS12) 10Andrew Bogott: Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) [16:52:52] (03PS1) 10Andrew Bogott: cinder-volume.conf: include common oslo-messaging-rabbit section [puppet] - 10https://gerrit.wikimedia.org/r/887368 (https://phabricator.wikimedia.org/T324729) [16:53:44] (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume.conf: include common oslo-messaging-rabbit section [puppet] - 10https://gerrit.wikimedia.org/r/887368 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [16:53:46] urbanecm: thank you so much for doing that outside of the window. really appreciate it! [16:53:52] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [16:58:40] glad i could be helpful [16:59:42] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10akosiaris) > How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the... [17:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:33] jouncebot: you're too late, I'm already taking the moon [17:00:43] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) 05In progress→03Resolved remote hands successfully removed the optic this AM and placed it in our racks, we'll just have it thrown away next remote hands work... [17:00:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:19] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) >>! In T328768#8594700, @akosiaris wrote: >> How is this possible, if there are no codfw app servers serving wikitech?... [17:06:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:13] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis) Hi @jbond and @MoritzMuehlenhoff - how are things looking with regard to this OIDC support? We would still like to be able to {T305874} using idp because the LDAP... [17:06:35] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis) [17:06:49] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10BTullis) [17:07:57] (03PS1) 10Raymond Ndibe: puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) [17:08:13] (03CR) 10CI reject: [V: 04-1] puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [17:09:37] (03PS2) 10Raymond Ndibe: puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) [17:09:57] (03CR) 10CI reject: [V: 04-1] puppet: Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [17:12:30] (03PS3) 10Raymond Ndibe: puppet: modify role::wmcs::nfs::primary for replica_cnf api [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) [17:13:57] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10jbond) @BTullis OIDC support is now possible and is being tried out by the new IDM. It should be to a state where you can start using it and happy to help out/provide more... [17:15:47] (03CR) 10Majavah: [C: 04-1] "I see a couple of issues here:" [puppet] - 10https://gerrit.wikimedia.org/r/887370 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [17:15:57] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: move version check to earlier [cookbooks] - 10https://gerrit.wikimedia.org/r/885864 (https://phabricator.wikimedia.org/T328593) [17:17:54] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:19:24] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: consolidate extra floating IP routes [puppet] - 10https://gerrit.wikimedia.org/r/887372 (https://phabricator.wikimedia.org/T329041) [17:19:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce additional route for VIPs [puppet] - 10https://gerrit.wikimedia.org/r/887373 (https://phabricator.wikimedia.org/T295774) [17:21:34] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2047.codfw.wmnet with OS bullseye [17:22:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1046.eqiad.wmnet with OS bullseye [17:22:48] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10jcrespo) No blocker on my side, then. Supporting path 5 (security worried me more than performance). [17:28:00] (03PS1) 10Jdlrobson: [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) [17:31:10] [a3e0ac52-9983-4e72-b47e-d455e43bd181] 2023-02-07 17:30:13: Fatal exception of type "Wikimedia\RequestTimeout\RequestTimeoutException" on frwiki, potentially because of template vandalism? [17:31:31] rzl, jhathaway^ [17:31:40] jynus: see -security [17:31:49] jynus: thanks [17:34:01] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage [17:35:13] (03CR) 10RLazarus: [C: 03+2] Cleanup: Drop pre-python3.7 support [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus) [17:36:52] (03Merged) 10jenkins-bot: Cleanup: Drop pre-python3.7 support [software/httpbb] - 10https://gerrit.wikimedia.org/r/886203 (owner: 10RLazarus) [17:37:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage [17:37:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage [17:38:06] (03CR) 10Mabualruz: "is this a duplicate of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/886852 or is it for the train?" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson) [17:40:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage [17:45:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:24] (03CR) 10Krinkle: [C: 03+1] "This is now ready to go I think? The mediawiki-config patch has gone out meanwhile, which removes these old schemas from the EventLogging " [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog) [17:51:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1046.eqiad.wmnet with OS bullseye [17:53:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2047.codfw.wmnet with OS bullseye [17:55:48] !log bking@cumin1001 repooling elastic and wdqs hosts post-maintenance T327925 [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:51] T327925: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1800) [18:00:08] (03PS1) 10Jdlrobson: Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) [18:00:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:24] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for 13 hosts [18:02:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 13 hosts [18:05:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:22] (03CR) 10Ottomata: [C: 03+2] eventlogging: Remove obsoleted navtiming schemas [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog) [18:09:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:17:25] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1047.eqiad.wmnet with OS bullseye [18:18:04] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2048.codfw.wmnet with OS bullseye [18:23:06] (03CR) 10Dzahn: [C: 03+1] "yes, the old recipes for ganeti VMs should be removed and the idea is very reasoanable. I won't pretend I can actually review (in a testin" [puppet] - 10https://gerrit.wikimedia.org/r/887330 (https://phabricator.wikimedia.org/T329035) (owner: 10Jelto) [18:25:55] 10SRE, 10ops-codfw, 10cloud-services-team, 10decommission-hardware: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Papaul) [18:26:58] 10SRE, 10ops-codfw, 10cloud-services-team, 10decommission-hardware: decommission clouddb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T328079 (10Papaul) 05Open→03Resolved This is complete [18:28:17] (03CR) 10Andrew Bogott: [C: 03+2] Move cinder-volume into its own class and profile [puppet] - 10https://gerrit.wikimedia.org/r/887320 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [18:29:03] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage [18:32:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage [18:34:26] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage [18:34:55] (03PS1) 10Andrew Bogott: openstack::cinder::volume: Pass $version down to the config module [puppet] - 10https://gerrit.wikimedia.org/r/887378 [18:37:01] (03CR) 10Andrew Bogott: [C: 03+2] openstack::cinder::volume: Pass $version down to the config module [puppet] - 10https://gerrit.wikimedia.org/r/887378 (owner: 10Andrew Bogott) [18:37:23] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10Papaul) p:05Triage→03Medium a:03cmooney [18:37:28] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage [18:40:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) fyi i tested connecting temporary the xe-0/0/47 to cr2 xe-5/0/0 link was okay ` papaul@re0.cr2-... [18:42:38] (03PS1) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595) [18:43:32] (03PS2) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595) [18:47:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1047.eqiad.wmnet with OS bullseye [18:50:51] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:51:44] (03PS1) 10Dzahn: add SPDX license headers to various roles I was involved in writing [puppet] - 10https://gerrit.wikimedia.org/r/887382 [18:52:53] (03PS2) 10Dzahn: add SPDX license headers to various roles I was involved in writing [puppet] - 10https://gerrit.wikimedia.org/r/887382 [18:53:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2048.codfw.wmnet with OS bullseye [18:55:02] (03PS3) 10Dzahn: phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595) [18:57:47] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:59:42] 10SRE, 10SRE-swift-storage, 10Commons: `Manifestation pour la défense des retraites du 31 janvier 2023 - Flickr - Jeanne Menjoulet.jpg` not found in Commons - https://phabricator.wikimedia.org/T328889 (10Dzahn) @Shizhao Great! Thanks. In case it happens again feel free to just reopen this or make a new ticke... [19:00:01] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2423,25,26,27 DNS - pt1979@cumin2002" [19:00:05] ^demon and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T1900). [19:00:26] o/ [19:00:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mw2423,25,26,27 DNS - pt1979@cumin2002" [19:00:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:01:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED [19:03:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED [19:03:53] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS bullseye [19:04:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:04:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1049.eqiad.wmnet with OS bullseye [19:04:32] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) As far as I can tell nowadays there is no more node that uses multiple roles. Only one role at a time, s... [19:05:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) https://gerrit.wikimedia.org/r/q/topic:%22role-profile%22+(status:open%20OR%20status:merged) [19:07:44] (03CR) 10Dzahn: [C: 03+2] phorge: create role/profile to install LAMP and git clone phorge [puppet] - 10https://gerrit.wikimedia.org/r/887379 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:11:46] (03PS1) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) [19:12:07] (03CR) 10CI reject: [V: 04-1] Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [19:13:33] (03PS2) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) [19:15:31] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:15:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage [19:16:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:20] (03PS3) 10Andrew Bogott: Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) [19:18:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage [19:20:03] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [19:21:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:16] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:23:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2049.codfw.wmnet with reason: host reimage [19:25:51] RECOVERY - Check systemd state on mw2350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:01] (03CR) 10Andrew Bogott: [C: 03+2] Add a new role for cloudvirt nodes with a cinder/lvm client. [puppet] - 10https://gerrit.wikimedia.org/r/887390 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [19:28:36] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T328825 (10Jclark-ctr) @Marostegui replaced failed drive [19:33:42] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1049.eqiad.wmnet with OS bullseye [19:39:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2049.codfw.wmnet with OS bullseye [19:40:49] (03CR) 10Bking: [C: 03+2] Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [19:41:19] (03PS2) 10Bking: Create scap deployment source for search airflow v2 [puppet] - 10https://gerrit.wikimedia.org/r/883678 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [19:41:38] (03PS4) 10Bking: Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [19:43:18] (03CR) 10Sergio Gimeno: [C: 03+1] labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan) [19:44:33] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED [19:44:36] (03CR) 10Bking: [V: 03+1] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [19:44:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED [19:44:53] (03CR) 10Bking: [V: 03+1 C: 03+2] Configure search platform airflow 2 instance [puppet] - 10https://gerrit.wikimedia.org/r/883680 (https://phabricator.wikimedia.org/T327970) (owner: 10Ebernhardson) [19:45:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED [19:46:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED [19:46:17] (03PS1) 10Dzahn: phorge: list of apache modules needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/887392 (https://phabricator.wikimedia.org/T328595) [19:47:13] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585) [19:47:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2422.mgmt.codfw.wmnet with reboot policy FORCED [19:47:15] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [19:47:17] (03CR) 10Dzahn: [C: 03+2] phorge: list of apache modules needs to be an array [puppet] - 10https://gerrit.wikimedia.org/r/887392 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [19:47:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2423.mgmt.codfw.wmnet with reboot policy FORCED [19:47:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [19:47:52] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887393 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [19:48:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED [19:52:10] (03CR) 10Kosta Harlan: [C: 03+2] labs: Remove unneeded GEUseNewImpactModule feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887271 (owner: 10Kosta Harlan) [19:53:37] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:53:53] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-airflow1005.eqiad.wmnet [19:53:54] !log bking@cumin1001 START - Cookbook sre.dns.netbox [19:54:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [19:55:15] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.22 refs T325585 [19:55:18] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [19:55:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [19:56:38] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1005.eqiad.wmnet - bking@cumin1001" [19:57:40] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-airflow1005.eqiad.wmnet - bking@cumin1001" [19:57:40] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:40] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache an-airflow1005.eqiad.wmnet on all recursors [19:57:44] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-airflow1005.eqiad.wmnet on all recursors [19:58:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED [19:59:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [20:00:22] (03PS1) 10Jdlrobson: Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) [20:04:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [20:08:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED [20:09:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2424.mgmt.codfw.wmnet with reboot policy FORCED [20:13:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2425.mgmt.codfw.wmnet with reboot policy FORCED [20:13:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [20:15:05] (03PS1) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) [20:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:24] (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:17:39] (03PS2) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) [20:17:59] (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:20:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:48] (03PS3) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) [20:21:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1051.eqiad.wmnet with OS bullseye [20:21:54] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2050.codfw.wmnet with OS bullseye [20:22:36] (03CR) 10CI reject: [V: 04-1] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:23:45] (03PS1) 10Dzahn: phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595) [20:24:50] (03PS2) 10Dzahn: phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595) [20:25:45] PROBLEM - MegaRAID on an-worker1091 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:26:19] (03PS4) 10Andrew Bogott: Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) [20:27:21] (03PS2) 10Ottomata: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [20:27:45] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:29:55] (03CR) 10Andrew Bogott: [C: 03+2] Add cinder-volume nodes to cinder grant and fw rules. [puppet] - 10https://gerrit.wikimedia.org/r/887395 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [20:29:57] (03CR) 10Dzahn: [C: 03+2] phorge: add minimal apache site, add class parameters for docroot et al [puppet] - 10https://gerrit.wikimedia.org/r/887397 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [20:30:38] (03CR) 10Sharvaniharan: Fix android session schema path (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [20:33:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage [20:33:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:36:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage [20:37:47] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10BTullis) @jbond - Many thanks. That's excellent. I think I'd be keen to look at doing that and helping find out the issues. I've asked the #data-engineering team so I'll get back to you in a coup... [20:38:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage [20:39:51] (03CR) 10Ottomata: Fix android session schema path (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [20:40:24] (03CR) 10Ottomata: [C: 03+2] Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [20:41:10] (03Merged) 10jenkins-bot: Fix android session schema path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886995 (owner: 10Sharvaniharan) [20:41:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage [20:41:22] dancy: am i clear to deploy a mw config change? [20:42:14] I think so. Train was rolled forward about an hour ago [20:44:39] !log bking@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-airflow1005.eqiad.wmnet [20:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:22] (03PS1) 10Andrew Bogott: cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729) [20:48:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:50:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1051.eqiad.wmnet with OS bullseye [20:52:14] (03PS1) 10Bking: search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) [20:53:33] (03PS1) 10Andrea Denisse: netmon: Add netmon1003 to the ganeti rapi nodes list [puppet] - 10https://gerrit.wikimedia.org/r/887409 (https://phabricator.wikimedia.org/T309074) [20:54:23] (03CR) 10Ryan Kemper: [C: 03+1] "1 excess newline but otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [20:55:11] (03PS2) 10Bking: search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) [20:55:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:57:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2050.codfw.wmnet with OS bullseye [20:57:23] (03PS1) 10Dzahn: phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595) [20:57:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:58:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:58:14] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2051.codfw.wmnet with OS bullseye [20:59:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 1.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:00:02] (03PS1) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230207T2100). nyaa~ [21:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:23] I can deploy today [21:00:33] Jdlrobson: hi, around? [21:01:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1053.eqiad.wmnet with OS bullseye [21:02:20] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventSreams - Fix android session schema path (duration: 07m 26s) [21:03:01] ottomata: pro-tip, use `scap backport 886995` next time. it does everything for you (from merging to gerrit to deployment) [21:03:04] very convenient :) [21:03:06] urbanecm present [21:03:18] hi! [21:03:49] (03CR) 10Urbanecm: [C: 03+2] Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson) [21:03:55] (03CR) 10Urbanecm: [C: 03+2] Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson) [21:04:03] (03CR) 10Urbanecm: [C: 03+2] [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson) [21:04:52] :) [21:05:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:44] (03CR) 10Cwhite: "This seems reasonable. Do we have alarms in place with paging for majority up? (e.g. 2 of 3 down issues a page)" [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [21:07:09] (03CR) 10Cwhite: [C: 03+1] hieradata: don't page for thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/887325 (owner: 10Filippo Giunchedi) [21:07:26] (03CR) 10Urbanecm: [C: 03+2] [followup] mediawiki.feedlink: Atom's link icon overlaps the link (031 comment) [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson) [21:07:46] (03PS2) 10Andrew Bogott: cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729) [21:07:48] (03PS2) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) [21:08:00] 10SRE, 10Traffic, 10Data Pipelines (Sprint 08): Document Impact of Jan 8&9 Traffic Data Loss - https://phabricator.wikimedia.org/T326658 (10EChetty) [21:08:27] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/887409 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [21:08:33] (03PS2) 10Dzahn: phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595) [21:08:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson) [21:08:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson) [21:09:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson) [21:09:07] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:09:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:10:16] (03CR) 10Cwhite: [C: 03+2] logstash: extract tcp flags from ulogd logs [puppet] - 10https://gerrit.wikimedia.org/r/884111 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [21:10:19] (03CR) 10Andrew Bogott: [C: 03+2] cinder-volume: make lvm volume group configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/887403 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [21:11:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:11:57] (03CR) 10Dzahn: [C: 03+2] phorge: add httpd Directory snippet, git clone arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887410 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [21:12:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED [21:12:47] 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse) a:05andrea.denisse→03None [21:12:52] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage [21:13:12] 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse) [21:13:25] 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse) a:05andrea.denisse→03None [21:14:04] 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10andrea.denisse) 05Resolved→03Open [21:14:20] 10ops-eqiad, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Decommission netmon1002 - https://phabricator.wikimedia.org/T322321 (10andrea.denisse) 05Resolved→03Open [21:14:34] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage [21:15:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage [21:16:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:17:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED [21:18:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage [21:18:47] (03Merged) 10jenkins-bot: Disable languages on history page [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887353 (https://phabricator.wikimedia.org/T328996) (owner: 10Jdlrobson) [21:19:39] (03PS3) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) [21:19:55] (03Merged) 10jenkins-bot: Remove button styling from log in link [skins/Vector] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887351 (https://phabricator.wikimedia.org/T289212) (owner: 10Jdlrobson) [21:20:00] (03Merged) 10jenkins-bot: [followup] mediawiki.feedlink: Atom's link icon overlaps the link [core] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887350 (https://phabricator.wikimedia.org/T327717) (owner: 10Jdlrobson) [21:20:28] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]] [21:20:34] T327717: Page tools: Support icon for Atom link (was Atom's link icon overlaps the link) - https://phabricator.wikimedia.org/T327717 [21:20:35] T289212: Feature request: Login button doesn't appear besides the create account text with new compact user menu - https://phabricator.wikimedia.org/T289212 [21:20:35] T328996: [Regression] History page help icon moved out of the header - https://phabricator.wikimedia.org/T328996 [21:21:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED [21:21:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED [21:22:20] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:22:33] (03PS4) 10Andrew Bogott: Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) [21:22:34] Jdlrobson: all three backports are at debug servers. can you check them please? [21:22:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED [21:22:47] looking now [21:22:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:24:38] (03CR) 10Andrew Bogott: [C: 03+2] Cinder-volume lvm: clarify backend type/name confusion [puppet] - 10https://gerrit.wikimedia.org/r/887411 (https://phabricator.wikimedia.org/T324729) (owner: 10Andrew Bogott) [21:24:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED [21:24:51] (03CR) 10Bking: [C: 03+2] search: add MAC entry for new an-airflow VM [puppet] - 10https://gerrit.wikimedia.org/r/887408 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [21:25:46] urbanecm: all 3 look good to me [21:25:50] syncing [21:26:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:20] Superpes: hi! let me know once you added your patch to the calendar too :) [21:26:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2427.mgmt.codfw.wmnet with reboot policy FORCED [21:29:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1053.eqiad.wmnet with OS bullseye [21:29:35] (03PS3) 10Superpes15: Install WikiLove extension on bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834) [21:31:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:887353|Disable languages on history page (T328996)]], [[gerrit:887351|Remove button styling from log in link (T289212)]], [[gerrit:887350|[followup] mediawiki.feedlink: Atom's link icon overlaps the link (T327717)]] (duration: 11m 10s) [21:31:43] Jdlrobson: all three live :) [21:31:44] T327717: Page tools: Support icon for Atom link (was Atom's link icon overlaps the link) - https://phabricator.wikimedia.org/T327717 [21:31:44] T289212: Feature request: Login button doesn't appear besides the create account text with new compact user menu - https://phabricator.wikimedia.org/T289212 [21:31:44] T328996: [Regression] History page help icon moved out of the header - https://phabricator.wikimedia.org/T328996 [21:32:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834) (owner: 10Superpes15) [21:32:14] Superpes: let's get started :) [21:32:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2426.mgmt.codfw.wmnet with reboot policy FORCED [21:32:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:32:34] urbanecm Yep! Thanks :) [21:32:48] (03Merged) 10jenkins-bot: Install WikiLove extension on bnwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886416 (https://phabricator.wikimedia.org/T328834) (owner: 10Superpes15) [21:32:49] thanks urbanecm double checking as we speak [21:33:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]] [21:33:17] T328834: Enable WikiLove extension on bnwikiquote - https://phabricator.wikimedia.org/T328834 [21:33:20] !log Create extension tables for Wikilove on bnwikiquote (T328834) [21:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:49] (03PS10) 10Urbanecm: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [21:34:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2051.codfw.wmnet with OS bullseye [21:35:03] !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:35:12] Superpes: can you check it at mwdebug1001 please? :) [21:35:18] (it=the wikilove patch) [21:36:30] Oh yep [21:38:18] let me know how it goes :) [21:38:28] urbanecm Lol I cannot login there [21:39:29] Superpes: you need x-wikimedia-debug [21:39:30] Superpes: what do you mean, please? You can load a wiki via mwdebug1001 (or other debug servers) by installing https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions and enabling it [21:39:42] (03CR) 10Herron: kafka-logging: don't page on individual broker down (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [21:39:51] Looking good urbanecm thanks for your help as alwayS! [21:39:57] then, you can load bnwikiquote using a staging environment at the debug server, and verify the change works as intended. [21:39:59] Jdlrobson: happy to help! [21:40:06] Yep I'm using WikimediaDebug urbanecm but I can't see if the Wikilove extensions actually works :/ [21:40:50] can you clarify why please? :) [21:41:35] urbanecm: maybe I can help with testing if needed? [21:42:07] i can test it as well (it appears to work for me), I'm trying to understand Superpes's issues, so they can test future patches too. [21:42:11] thanks for the offer though :) [21:42:33] no problem, happy to help [21:42:41] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:46] I'm proceeding with the sync, as the extension appears to work. still happy to help with the testing issue though :) [21:42:48] urbanecm Uhm the problem is that I don't see any wikilove tool via IP (and can't login) [21:42:56] ah! you meant login on the wiki [21:42:59] Thanks urbanecm let's talk later about it [21:43:02] gotcha [21:43:02] Yep [21:43:38] (03CR) 10Urbanecm: [C: 03+2] Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [21:44:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:44:22] (03Merged) 10jenkins-bot: Change the trwiki logo with a temporary one (old vector) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886983 (https://phabricator.wikimedia.org/T329047) (owner: 10Superpes15) [21:48:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886416|Install WikiLove extension on bnwikiquote (T328834)]] (duration: 15m 32s) [21:48:51] T328834: Enable WikiLove extension on bnwikiquote - https://phabricator.wikimedia.org/T328834 [21:48:56] Superpes: the patch's live now [21:49:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:49:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]] [21:49:31] T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047 [21:49:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:49:42] Ok thanks :) [21:51:14] !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:51:34] Superpes: trwiki patch's at mwdebug1001. can you check this one please? or do you want me to help too? [21:51:37] (03PS1) 10Cwhite: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) [21:51:39] (03PS1) 10Volans: sre.hosts.provision: add sleep for race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/887416 [21:51:50] urbanecm The logo is without a wordmark... So only the image remains without text :/ I don't know if they expected this! The rest seems ok [21:53:03] Superpes: the wordmark's not present in the png file, so it's not included. i guess that's an error in the file, is that right? [21:53:32] Yep! They probably didn't think about inserting a logo with text, so I don't think it will be a problem, but objectively it is not the best choice... But I added the logo they posted ;) [21:53:51] so, should we go ahead? or revert? [21:53:53] So nothing wrong from my side! [21:54:00] okay, proceeding [21:54:01] (03CR) 10Volans: [C: 03+2] "trivial, self-merging to check if this solves the problem or not" [cookbooks] - 10https://gerrit.wikimedia.org/r/887416 (owner: 10Volans) [21:54:32] (03PS8) 10Urbanecm: Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15) [21:54:44] Yep thanks ;) I'll ask them in the task if they want another image with text! [21:54:55] sounds good [21:55:54] (03Merged) 10jenkins-bot: sre.hosts.provision: add sleep for race condition [cookbooks] - 10https://gerrit.wikimedia.org/r/887416 (owner: 10Volans) [21:56:16] (03CR) 10Urbanecm: [C: 03+2] Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15) [21:56:56] (03Merged) 10jenkins-bot: Allow AbuseFilter to block IPs and users on itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884333 (https://phabricator.wikimedia.org/T328194) (owner: 10Superpes15) [21:57:04] (03PS2) 10Cwhite: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) [21:59:15] (03PS1) 10Cwhite: logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806) [21:59:47] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:886983|Change the trwiki logo with a temporary one (old vector) (T329047)]] (duration: 10m 20s) [21:59:50] T329047: Temporary logo change on the Turkish Wikipedia - https://phabricator.wikimedia.org/T329047 [21:59:59] second patch's live [22:00:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]] [22:00:19] T328194: Enable AbuseFilter blocks on itwikiversity - https://phabricator.wikimedia.org/T328194 [22:01:58] !log urbanecm@deploy1002 urbanecm and superpes: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [22:02:28] Superpes: i guess this is hard for you to test, right? [22:02:31] @urbanec It works properly :) [22:02:43] * urbanecm [22:02:48] okaz, great [22:04:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [22:06:59] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "provision new Ganeti VM an-airflow1005 - bking@cumin1001 - T327970" [22:07:02] T327970: Create airflow v2 instance and supporting repos for search platform - https://phabricator.wikimedia.org/T327970 [22:07:40] (03PS2) 10Cwhite: logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806) [22:08:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:884333|Allow AbuseFilter to block IPs and users on itwikiversity (T328194)]] (duration: 08m 23s) [22:08:41] (03CR) 10Cwhite: [C: 03+2] Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [22:08:42] T328194: Enable AbuseFilter blocks on itwikiversity - https://phabricator.wikimedia.org/T328194 [22:08:49] Superpes: and, third patch live. [22:09:09] urbanecm Thanks for your time and for support :D [22:09:11] (03Merged) 10jenkins-bot: Add network.tcp_flags mapping and docs [software/ecs] - 10https://gerrit.wikimedia.org/r/886853 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [22:09:15] happy to help! [22:09:59] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "provision new Ganeti VM an-airflow1005 - bking@cumin1001 - T327970" [22:10:34] (03CR) 10Cwhite: kafka-logging: don't page on individual broker down (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887342 (owner: 10Herron) [22:12:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:12:25] (03CR) 10Cwhite: [C: 03+2] logstash: update ecs to 1.11.0-6 [puppet] - 10https://gerrit.wikimedia.org/r/886854 (https://phabricator.wikimedia.org/T325806) (owner: 10Cwhite) [22:13:14] (03PS1) 10JHathaway: Add jaeger-es-index-cleaner [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) [22:14:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B6 - pt1979@cumin2002" [22:14:49] (03CR) 10JHathaway: "kindly review" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/887417 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [22:15:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B6 - pt1979@cumin2002" [22:15:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:15:51] (03CR) 10JHathaway: "Giuseppe would you please take another look at this updated patch when you have the time." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [22:16:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2428.mgmt.codfw.wmnet with reboot policy FORCED [22:16:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2429.mgmt.codfw.wmnet with reboot policy FORCED [22:20:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, and 2 others: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10taavi) [22:26:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2428.mgmt.codfw.wmnet with reboot policy FORCED [22:30:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2429.mgmt.codfw.wmnet with reboot policy FORCED [22:31:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2430.mgmt.codfw.wmnet with reboot policy FORCED [22:31:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2431.mgmt.codfw.wmnet with reboot policy FORCED [22:33:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [22:41:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2431.mgmt.codfw.wmnet with reboot policy FORCED [22:41:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:41:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2430.mgmt.codfw.wmnet with reboot policy FORCED [22:42:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [22:43:23] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B8 - pt1979@cumin2002" [22:44:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in B8 - pt1979@cumin2002" [22:44:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:45:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2432.mgmt.codfw.wmnet with reboot policy FORCED [22:46:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2433.mgmt.codfw.wmnet with reboot policy FORCED [22:53:21] RECOVERY - MegaRAID on db1155 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:56:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2433.mgmt.codfw.wmnet with reboot policy FORCED [22:56:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2432.mgmt.codfw.wmnet with reboot policy FORCED [22:59:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2434.mgmt.codfw.wmnet with reboot policy FORCED [22:59:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2435.mgmt.codfw.wmnet with reboot policy FORCED [23:03:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) @Jclark-ctr Sorry for the delay on this. They weren't urgent and now the December fundraising is complete. You are clear to rack and cable... [23:06:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2435.mgmt.codfw.wmnet with reboot policy FORCED [23:06:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2434.mgmt.codfw.wmnet with reboot policy FORCED [23:13:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [23:14:27] RECOVERY - MegaRAID on an-worker1091 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:22:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2420'] [23:23:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2421'] [23:29:29] (03Abandoned) 10Aaron Schulz: Avoid udp2log for "objectcache" channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/712548 (https://phabricator.wikimedia.org/T288702) (owner: 10Aaron Schulz) [23:30:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2420'] [23:31:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2421'] [23:32:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2422'] [23:32:52] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2423'] [23:45:54] (03PS1) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) [23:46:07] (03CR) 10CI reject: [V: 04-1] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [23:47:50] (03PS2) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) [23:48:11] (03CR) 10CI reject: [V: 04-1] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [23:49:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2422'] [23:50:06] (03PS3) 10Dzahn: phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) [23:51:16] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mw2423'] [23:56:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2424'] [23:56:39] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2425'] [23:57:02] (03CR) 10Dzahn: [C: 03+2] phorge: separate install pathes for phorge and arcanist [puppet] - 10https://gerrit.wikimedia.org/r/887429 (https://phabricator.wikimedia.org/T328595) (owner: 10Dzahn) [23:57:14] (03PS1) 10Dzahn: phorge: git clone arcanist also from we.phorge.it, not Phacility [puppet] - 10https://gerrit.wikimedia.org/r/887431 (https://phabricator.wikimedia.org/T328595)