[00:54:52] (03CR) 10Eric Gardner: [C: 03+2] Revert "Prepare for MediaWiki UI version 2" [extensions/MultimediaViewer] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705755 (owner: 10Jdlrobson) [00:59:48] (03Merged) 10jenkins-bot: Revert "Prepare for MediaWiki UI version 2" [extensions/MultimediaViewer] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705755 (owner: 10Jdlrobson) [01:10:27] (03PS1) 10Razzi: netboot: make an-masters reimage without confirmation [puppet] - 10https://gerrit.wikimedia.org/r/705782 (https://phabricator.wikimedia.org/T278423) [01:14:01] (03PS2) 10Razzi: netboot: make an-masters reimage without confirmation [puppet] - 10https://gerrit.wikimedia.org/r/705782 (https://phabricator.wikimedia.org/T278423) [01:38:31] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:39:33] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: dumps-rsyncer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:19] (03PS1) 10Marostegui: wmnet: Switch m3-master [dns] - 10https://gerrit.wikimedia.org/r/705789 (https://phabricator.wikimedia.org/T286065) [05:00:43] (03CR) 10Marostegui: [C: 03+2] wmnet: Switch m3-master [dns] - 10https://gerrit.wikimedia.org/r/705789 (https://phabricator.wikimedia.org/T286065) (owner: 10Marostegui) [05:01:36] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) I have switched m3-master from dbproxy1020 to dbproxy1016: https://gerrit.wikimedia.org/r/705789 [05:02:08] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [05:17:29] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) @Jclark-ctr did this disk arrive? [05:40:52] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:27] 10SRE, 10serviceops, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Marostegui) That would work for me too @wkandek - thanks! [06:35:45] !log disable puppet on mc1* hosts and icinga - T271967 [06:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:53] T271967: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 [06:41:21] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable TLS on memcached eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/702590 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [06:43:38] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:50:27] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update puppetdb::microsite [puppet] - 10https://gerrit.wikimedia.org/r/705703 (owner: 10Filippo Giunchedi) [06:50:32] (03PS2) 10Filippo Giunchedi: pontoon: update puppetdb::microsite [puppet] - 10https://gerrit.wikimedia.org/r/705703 [06:51:14] !log enable puppet on mc* hosts [06:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:36] !log restart memcached on eqiad mc* hosts [06:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:44] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetdb: rename stockpile mount var [puppet] - 10https://gerrit.wikimedia.org/r/705702 (owner: 10Filippo Giunchedi) [07:03:14] !log installing systemd security updates on stretch [07:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [07:09:16] PROBLEM - very high load average likely xfs on ms-be2048 is CRITICAL: CRITICAL - load average: 269.44, 227.08, 135.22 https://wikitech.wikimedia.org/wiki/Swift [07:12:10] PROBLEM - SSH on ms-be2048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:14:41] that's not an happy host [07:16:04] !log powercycle ms-be2048 [07:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] PROBLEM - Host ms-be2048 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:40] RECOVERY - SSH on ms-be2048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:19:42] RECOVERY - Host ms-be2048 is UP: PING OK - Packet loss = 0%, RTA = 34.83 ms [07:20:46] RECOVERY - very high load average likely xfs on ms-be2048 is OK: OK - load average: 21.22, 7.66, 2.75 https://wikitech.wikimedia.org/wiki/Swift [07:21:04] PROBLEM - MD RAID on ms-be2048 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:21:05] ACKNOWLEDGEMENT - MD RAID on ms-be2048 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T287064 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:21:09] 10SRE, 10ops-codfw: Degraded RAID on ms-be2048 - https://phabricator.wikimedia.org/T287064 (10ops-monitoring-bot) [07:23:35] 10SRE, 10ops-codfw: Degraded RAID on ms-be2048 - https://phabricator.wikimedia.org/T287064 (10fgiunchedi) 05Open→03Invalid Host rebooted from a lockup, I added the disk back [07:38:03] Good morning. So I screwed up yesterday and forgot to run the train [07:38:07] !log update RIS peer IP on cr2-codfw [07:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:16] I am thus doing it this morning [07:44:13] !log push extra sampling on cr1-eqiad - T286038 [07:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:20] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [07:44:24] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:44:36] * RhinosF1 gives hashar a coffee first [07:45:08] (03CR) 10Gergő Tisza: [C: 03+1] mediawiki/maintenance/growthexperiments.pp: Run updateMenteeData every day [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [07:47:30] RhinosF1: thank you! [07:49:37] hashar: it's ok! [07:52:55] (03PS1) 10Hashar: Group0 to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705832 (https://phabricator.wikimedia.org/T281156) [07:56:25] !log push extra sampling on cr2-eqiad - T286038 [07:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:32] T286038: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 [07:58:05] (03PS1) 10Ayounsi: Fix typo, xe-3/2/3 doesn't exist [homer/public] - 10https://gerrit.wikimedia.org/r/705833 [07:58:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) `lang=diff re0.cr2-eqiad# show | compare [edit interfaces xe-3/2/2 unit 0 family inet filter] + output sample-ac... [07:59:31] (03CR) 10Ayounsi: [C: 03+2] Fix typo, xe-3/2/3 doesn't exist [homer/public] - 10https://gerrit.wikimedia.org/r/705833 (owner: 10Ayounsi) [07:59:36] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: Not alerting due to fresh production wikiversions: 973 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [08:00:08] (03Merged) 10jenkins-bot: Fix typo, xe-3/2/3 doesn't exist [homer/public] - 10https://gerrit.wikimedia.org/r/705833 (owner: 10Ayounsi) [08:02:50] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [08:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) Talked to @fgiunchedi on IRC, let us know when to rollback. Ideally before the end of the week so we don't keep "hacks"... [08:05:44] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-openstack-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:57] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.11 (duration: 11m 51s) [08:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:12] !log hashar@deploy1002 Pruned MediaWiki: 1.37.0-wmf.12 (duration: 01m 35s) [08:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Thank you @ayounsi @cmooney ! Could we keep the sampling for a week straight ? I understand if you are not comforta... [08:13:51] (03PS1) 10Hashar: Convert BlockUtils::parseBlockTarget to UserIdentity [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705758 (https://phabricator.wikimedia.org/T286490) [08:15:00] !log hashar@deploy1002 Started scap: testwiki to php-1.37.0-wmf.15 and rebuild l10n cache # T281156 [08:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:09] T281156: 1.37.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T281156 [08:15:40] RECOVERY - MD RAID on ms-be2048 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:17:29] !log enable puppet on alert* [08:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [08:22:14] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [08:23:29] (03PS1) 10Jgiannelos: tegola-vector-tiles: Configure region for s3/swift requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/705835 [08:25:01] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Configure region for s3/swift requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/705835 (owner: 10Jgiannelos) [08:25:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [08:26:13] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Configure region for s3/swift requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/705835 (owner: 10Jgiannelos) [08:27:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [08:28:39] (03Merged) 10jenkins-bot: tegola-vector-tiles: Configure region for s3/swift requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/705835 (owner: 10Jgiannelos) [08:31:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 18 hosts with reason: Deploying schema change to s1 T281058 [08:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 18 hosts with reason: Deploying schema change to s1 T281058 [08:31:37] T281058: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 [08:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:56] !log upgrade karma on alert hosts - T284213 [08:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:02] T284213: Improve AlertManager dashboard - https://phabricator.wikimedia.org/T284213 [08:34:03] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [08:35:23] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705836 [08:35:41] (03PS1) 10Filippo Giunchedi: alertmanager: hide 'severity' label name in grid [puppet] - 10https://gerrit.wikimedia.org/r/705837 (https://phabricator.wikimedia.org/T284213) [08:36:01] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:36:18] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) 05Open→03Resolved [08:36:24] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705836 (owner: 10Jgiannelos) [08:36:36] RECOVERY - Check systemd state on db2116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:48] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/compiler1001/30279/" [puppet] - 10https://gerrit.wikimedia.org/r/702592 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [08:38:32] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705836 (owner: 10Jgiannelos) [08:40:58] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/705836 (owner: 10Jgiannelos) [08:44:21] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [08:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:27] !log disble puppet on codfw mw hosts to deploy 702592 [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:30] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10Joe) p:05Triage→03Medium a:03Joe Access logs and other logs are easy to collect as well. See: - https://istio.io/latest/docs/tasks/observability/logs/ - https://istio.io/l... [08:49:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) (owner: 10Filippo Giunchedi) [08:50:28] (03PS1) 10Muehlenhoff: Disable unprivileged user namespaces on Bullseye and Buster hosts with 5.10 [puppet] - 10https://gerrit.wikimedia.org/r/705838 [08:50:38] (03CR) 10Filippo Giunchedi: [C: 03+2] switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) (owner: 10Filippo Giunchedi) [08:50:43] (03PS3) 10Filippo Giunchedi: switchdc: remove thanos from excluded services [cookbooks] - 10https://gerrit.wikimedia.org/r/705349 (https://phabricator.wikimedia.org/T285273) [08:51:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:52:05] (03CR) 10jerkins-bot: [V: 04-1] Disable unprivileged user namespaces on Bullseye and Buster hosts with 5.10 [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [08:52:49] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:53:55] (03PS2) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705704 [08:53:58] (03PS2) 10Filippo Giunchedi: puppetdb: wait for stockpile initialization [puppet] - 10https://gerrit.wikimedia.org/r/705705 [08:53:59] (03PS2) 10Filippo Giunchedi: puppetdb: set permissions post-mount [puppet] - 10https://gerrit.wikimedia.org/r/705706 [08:56:48] (03CR) 10Ayounsi: [C: 03+1] alertmanager: hide 'severity' label name in grid [puppet] - 10https://gerrit.wikimedia.org/r/705837 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [08:59:44] (03PS2) 10Muehlenhoff: Disable unprivileged user namespaces on Bullseye and Buster hosts with 5.10 [puppet] - 10https://gerrit.wikimedia.org/r/705838 [09:00:27] (03PS1) 10Jbond: changelog: add entry related to delegated authenentication [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/705839 [09:00:52] !log hashar@deploy1002 Finished scap: testwiki to php-1.37.0-wmf.15 and rebuild l10n cache # T281156 (duration: 45m 51s) [09:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:59] T281156: 1.37.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T281156 [09:01:13] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:01:36] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [09:01:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] changelog: add entry related to delegated authenentication [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/705839 (owner: 10Jbond) [09:01:50] 10SRE, 10Thumbor, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10Jelto) During the refresh of old mw app servers in eqiad we noticed that thumbor machines `thumbor1001` and `thumbor1002` are renamed/reimaged mw hosts. As mentioned in T280203 and T233196 these machi... [09:02:21] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [09:05:11] (03PS1) 10Hashar: group0 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705840 [09:05:13] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705840 (owner: 10Hashar) [09:05:52] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:03] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705840 (owner: 10Hashar) [09:07:14] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.15 [09:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [09:11:03] 10SRE, 10Datacenter-Switchover, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10fgiunchedi) {{done}} [09:11:19] 10SRE, 10Datacenter-Switchover, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Switchover thanos-query and thanos-swift services as part of DC switchover - https://phabricator.wikimedia.org/T285273 (10fgiunchedi) 05Open→03Resolved [09:14:27] (03CR) 10Daniel Kinzler: [C: 04-2] "CR-2 per Petr'S comment on the ticket: "However, this is a deprecation warning with a fix that is more risky then the problem. Definitely " [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705758 (https://phabricator.wikimedia.org/T286490) (owner: 10Hashar) [09:15:15] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: hide 'severity' label name in grid [puppet] - 10https://gerrit.wikimedia.org/r/705837 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [09:16:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "as far as I know, neither cloudgw or cloudnet servers use unprivileged netns. They are privileged (as in, created by root)." [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [09:23:37] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:31:40] 10SRE, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) In light of longer than expected lead time for new ms-be hardware (T284953) I'd like to explore again the object expi... [09:33:55] (03CR) 10Elukey: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [09:34:02] (03CR) 10Elukey: [C: 03+1] Disable unprivileged user namespaces on Bullseye and Buster hosts with 5.10 [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [09:34:48] !log restart db2097 T287072 [09:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:55] T287072: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 [09:35:27] 10SRE, 10Thumbor, 10serviceops: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10jijiki) @Jelto this work is currently stalled, but T285477 is created to accommodate the olderst 2 thumbor hosts. [09:36:51] (03Abandoned) 10Hashar: Convert BlockUtils::parseBlockTarget to UserIdentity [core] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705758 (https://phabricator.wikimedia.org/T286490) (owner: 10Hashar) [09:47:19] (03PS11) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [09:50:07] Push FW rules to pfw3-codfw - T287038 [09:50:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [09:53:16] Push FW rules to pfw3-eqiad - T287038 [09:53:42] 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) a:05jcrespo→03Papaul As expected, the fauly memory module is only properly detected on reboot. ` free -g tota... [09:56:50] XioNoX: FYI no !log [09:57:10] er [09:57:20] !log Pushed FW rules to pfw3-eqiad/codfw - T287038 [09:57:23] thx [09:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:24] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: replace mcrouter proxies in with eqiad hosts [puppet] - 10https://gerrit.wikimedia.org/r/702592 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [09:59:38] sure [10:04:45] (03PS1) 10Muehlenhoff: Fix the autostart logic for the systemd preset config file [puppet] - 10https://gerrit.wikimedia.org/r/705847 [10:06:57] (03PS1) 10Arturo Borrero Gonzalez: toolforge: jobs_framework_cli: introduce configuration file [puppet] - 10https://gerrit.wikimedia.org/r/705848 [10:07:44] (03PS1) 10Jgiannelos: Revert "Temporary log all s3 SDK requests/responses" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705759 [10:10:27] (03PS2) 10Arturo Borrero Gonzalez: toolforge: jobs_framework_cli: introduce configuration file [puppet] - 10https://gerrit.wikimedia.org/r/705848 [10:12:48] ccccccnuubgneldrkvfkflgkdngcelkddvjvlijjeulk [10:12:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: jobs_framework_cli: introduce configuration file [puppet] - 10https://gerrit.wikimedia.org/r/705848 (owner: 10Arturo Borrero Gonzalez) [10:14:34] !log enable puppet on mw* servers [10:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:56] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/30283/" [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:16:22] jelto: I fully agree with the sentence [10:17:35] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 976 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [10:17:46] ^ fixing [10:19:44] (03CR) 10Jbond: Fix the autostart logic for the systemd preset config file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:22:06] (03PS2) 10Muehlenhoff: Fix the autostart logic for the systemd preset config file [puppet] - 10https://gerrit.wikimedia.org/r/705847 [10:22:08] (03CR) 10Muehlenhoff: Fix the autostart logic for the systemd preset config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:22:33] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10hnowlan) [10:22:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:23:27] RECOVERY - Ensure local MW versions match expected deployment on mw2384 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [10:27:49] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [10:28:35] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [10:29:38] (03CR) 10Muehlenhoff: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1001/30284/" [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:29:41] (03CR) 10Muehlenhoff: [C: 03+2] Fix the autostart logic for the systemd preset config file [puppet] - 10https://gerrit.wikimedia.org/r/705847 (owner: 10Muehlenhoff) [10:33:43] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [10:33:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2010.codfw.wmnet [10:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:26] (03PS4) 10Jbond: O:puppetmaster::puppetdb: rename role to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/701931 (https://phabricator.wikimedia.org/T285666) [10:37:12] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) We need to fix the craziness of current partitioning on all servers: ` # cumin 'P:mediabackup::storage' 'lsblk -b /dev/sdc' 8 hosts will be targeted:... [10:37:54] (03PS1) 10Muehlenhoff: Complement autostart test setup with a native systemd service [puppet] - 10https://gerrit.wikimedia.org/r/705853 [10:39:56] (03PS3) 10Filippo Giunchedi: puppetdb: wait for stockpile initialization [puppet] - 10https://gerrit.wikimedia.org/r/705705 [10:40:37] (03CR) 10Muehlenhoff: [C: 03+2] Complement autostart test setup with a native systemd service [puppet] - 10https://gerrit.wikimedia.org/r/705853 (owner: 10Muehlenhoff) [10:40:48] (03CR) 10Filippo Giunchedi: puppetdb: set permissions post-mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705706 (owner: 10Filippo Giunchedi) [10:40:56] (03Abandoned) 10Filippo Giunchedi: puppetdb: set permissions post-mount [puppet] - 10https://gerrit.wikimedia.org/r/705706 (owner: 10Filippo Giunchedi) [10:41:02] (03PS1) 10Jbond: C:puppetdb::app: make stockpile dir and vardir local variables [puppet] - 10https://gerrit.wikimedia.org/r/705855 [10:41:13] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) There is an initial grafana dashboard, but will need a lot of work, it is almost unusuable for now (not sure if because of the lack of activity, the m... [10:41:46] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: make stockpile dir and vardir local variables [puppet] - 10https://gerrit.wikimedia.org/r/705855 (owner: 10Jbond) [10:42:36] (03CR) 10Filippo Giunchedi: "Thank you for your help!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [10:42:51] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: enable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) [10:50:50] !log installing systemd security updates on bullseye [10:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:13] (03CR) 10Filippo Giunchedi: "Thank you for the review!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [10:52:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [10:53:18] (03PS2) 10Jbond: C:puppetdb::app: make stockpile dir and vardir local variables [puppet] - 10https://gerrit.wikimedia.org/r/705855 [10:54:08] (03CR) 10Jelto: [C: 03+2] prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [10:54:26] (03PS12) 10Jelto: prometheus::ops add jobs and ferm rule to scrape gitlab metrics [puppet] - 10https://gerrit.wikimedia.org/r/704503 (https://phabricator.wikimedia.org/T275170) [10:54:52] (03PS3) 10Effie Mouzeli: add mwdebug service to LVS 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) [10:56:32] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: enable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) [10:57:10] (03PS1) 10Muehlenhoff: autostart: Fix option for disabling a unit if used with update-rc.d [puppet] - 10https://gerrit.wikimedia.org/r/705856 [10:59:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: enable TTLAfterFinished feature gate [puppet] - 10https://gerrit.wikimedia.org/r/704958 (https://phabricator.wikimedia.org/T286108) (owner: 10Arturo Borrero Gonzalez) [11:00:00] (03CR) 10MSantos: [C: 03+1] Revert "Temporary log all s3 SDK requests/responses" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705759 (owner: 10Jgiannelos) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1100). [11:00:05] Urbanecm: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] I'll self-service [11:00:27] ok :) [11:00:51] (03PS4) 10Urbanecm: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [11:00:55] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [11:01:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [11:01:23] (03CR) 10Jbond: [C: 03+2] C:puppetdb::app: make stockpile dir and vardir local variables [puppet] - 10https://gerrit.wikimedia.org/r/705855 (owner: 10Jbond) [11:01:35] (03Merged) 10jenkins-bot: GrowthExperiments: Add more wikis to linkrecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703179 (https://phabricator.wikimedia.org/T284481) (owner: 10Kosta Harlan) [11:02:44] (03CR) 10Jgiannelos: [C: 03+2] Revert "Temporary log all s3 SDK requests/responses" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705759 (owner: 10Jgiannelos) [11:03:19] (03CR) 10Effie Mouzeli: [C: 03+2] Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [11:03:32] (03PS4) 10Effie Mouzeli: Add entries for mwdebug service [dns] - 10https://gerrit.wikimedia.org/r/704948 (https://phabricator.wikimedia.org/T283056) [11:03:50] (03Merged) 10jenkins-bot: Revert "Temporary log all s3 SDK requests/responses" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/705759 (owner: 10Jgiannelos) [11:05:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,rails,redis_gitlab,sidekiq,workhorse} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:05:22] (03CR) 10Muehlenhoff: [C: 03+2] autostart: Fix option for disabling a unit if used with update-rc.d [puppet] - 10https://gerrit.wikimedia.org/r/705856 (owner: 10Muehlenhoff) [11:07:24] (03CR) 10Volans: [C: 03+1] "LGTM! We might improve a user-facing message, see inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [11:08:55] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBClientSettings['repoNamespaces'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) [11:08:57] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgWikibaseClientRepoNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705858 (https://phabricator.wikimedia.org/T257260) [11:09:03] * jelto is working on metrics for gitlab and subcomponents (Prometheus jobs reduced availability on alert1001 is CRITI.CAL) [11:09:20] (03PS2) 10Effie Mouzeli: Add entries for tegola-vector-tiles service [dns] - 10https://gerrit.wikimedia.org/r/704955 (https://phabricator.wikimedia.org/T283159) [11:10:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM before wmf.16 is safely rolled out to all wikis and won’t be rolled back again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705857 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [11:10:20] (03PS1) 10Jgiannelos: tegola-vector-tiles: Enable prometheus metrics endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 [11:10:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:11:13] (03CR) 10Effie Mouzeli: [C: 03+2] Add entries for tegola-vector-tiles service [dns] - 10https://gerrit.wikimedia.org/r/704955 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [11:11:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d6699dae1e96b38b4fae7e8b9817d84b56d2be6c: GrowthExperiments: Add more wikis to linkrecommendation experiment (T284481) (duration: 01m 31s) [11:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] T284481: Deploy Add a link to the second set of wikis - https://phabricator.wikimedia.org/T284481 [11:12:32] 10SRE, 10Wikimedia-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Aklapper) >>! In T275437#6850975, @Pchelolo wrote: > Ok, restarting jobqueue change propagation service might have resolved the problem. Will investigate the root cause of this.... [11:12:41] (03PS2) 10Jgiannelos: tegola-vector-tiles: Enable prometheus metrics endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 [11:14:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,rails,redis_gitlab,sidekiq,workhorse} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:15:22] * urbanecm is done with deployments [11:16:48] (03PS1) 10Effie Mouzeli: Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705860 (https://phabricator.wikimedia.org/T283056) [11:17:33] (03CR) 10jerkins-bot: [V: 04-1] Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705860 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [11:17:46] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,gitlab,nginx,rails,redis_gitlab,sidekiq,workhorse} site={codfw,eqiad} Jelto setting up gitlab metrics https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:21:32] (03CR) 10Effie Mouzeli: [C: 03+2] conftool-data: add mwdebug discovery 1 [puppet] - 10https://gerrit.wikimedia.org/r/704799 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [11:26:18] !log hashar@deploy1002 Started deploy [integration/docroot@0515d9c]: Support linking to individual doc.wikimedia.org tiles [11:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:28] !log hashar@deploy1002 Finished deploy [integration/docroot@0515d9c]: Support linking to individual doc.wikimedia.org tiles (duration: 00m 09s) [11:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:00] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [11:28:14] (03CR) 10Muehlenhoff: [C: 03+2] Disable unprivileged user namespaces on Bullseye and Buster hosts with 5.10 [puppet] - 10https://gerrit.wikimedia.org/r/705838 (owner: 10Muehlenhoff) [11:30:14] (03Abandoned) 10Effie Mouzeli: Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705860 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [11:30:51] (03PS3) 10Jgiannelos: tegola-vector-tiles: Enable prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 [11:35:56] (03PS1) 10Effie Mouzeli: Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705862 (https://phabricator.wikimedia.org/T283056) [11:38:30] (03PS2) 10Effie Mouzeli: Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705862 (https://phabricator.wikimedia.org/T283056) [11:40:53] (03CR) 10Effie Mouzeli: [C: 03+2] Add discovery for mwdebug [dns] - 10https://gerrit.wikimedia.org/r/705862 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [12:04:42] (03PS4) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [12:08:00] (03PS1) 10Effie Mouzeli: Revert "Add discovery for mwdebug" [dns] - 10https://gerrit.wikimedia.org/r/705760 [12:18:35] (03PS2) 10Dzahn: site/conftool: add mw1437,mw1438 as canary jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/705721 (https://phabricator.wikimedia.org/T279309) [12:23:02] (03PS1) 10Jelto: Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 [12:24:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (owner: 10Jelto) [12:25:50] (03CR) 10Dzahn: [C: 03+2] site/conftool: add mw1437,mw1438 as canary jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/705721 (https://phabricator.wikimedia.org/T279309) (owner: 10Dzahn) [12:26:55] 10SRE, 10User-MoritzMuehlenhoff: Monitor sensitive sysctl settings - https://phabricator.wikimedia.org/T287081 (10MoritzMuehlenhoff) [12:28:40] (03PS2) 10Jelto: Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) [12:30:46] (03PS3) 10Jelto: Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) [12:30:55] (03CR) 10Jbond: icinga: Write to Icinga command file instead of calling icinga-downtime (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [12:31:18] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:31:58] (03PS4) 10Jelto: Revert "prometheus::ops add jobs and ferm rule to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) [12:32:29] (03PS1) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [12:33:00] (03PS6) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [12:37:33] (03PS5) 10Jelto: Revert "prometheus::ops add jobs to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) [12:39:47] (03CR) 10JMeybohm: [C: 03+1] Revert "prometheus::ops add jobs to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:42:15] (03CR) 10Jelto: [C: 03+2] Revert "prometheus::ops add jobs to scrape gitlab metrics" [puppet] - 10https://gerrit.wikimedia.org/r/705761 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [12:44:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1437-1438].eqiad.wmnet with reason: new host [12:44:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1437-1438].eqiad.wmnet with reason: new host [12:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:00] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10Papaul) a:03Papaul [12:48:36] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10Papaul) @fgiunchedi don't really know why and it is HP server so i am not surprise also we are using a BBU pulled from a decom server as well. I will find another BBU and replace. Thanks [12:48:56] (03CR) 10JMeybohm: [C: 03+1] "This LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/698895 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:50:25] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) @jcrespo I will request for HP to send us a new DIMM [12:52:56] (03CR) 10JMeybohm: [C: 04-1] miscweb: add a define for the httpd prometheus exporter and use it (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [12:55:22] (03CR) 10Kormat: [C: 03+1] "This seems reasonable to me." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [12:56:19] (03PS4) 10Effie Mouzeli: add mwdebug service to LVS 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) [12:58:42] (03CR) 10Vgutierrez: [C: 03+1] add mwdebug service to LVS 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [12:58:57] (03CR) 10Effie Mouzeli: [C: 03+2] add mwdebug service to LVS 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) (owner: 10Effie Mouzeli) [12:59:11] (03PS5) 10Effie Mouzeli: add mwdebug service to LVS 2 [puppet] - 10https://gerrit.wikimedia.org/r/704964 (https://phabricator.wikimedia.org/T283056) [13:01:50] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Thank you! [13:02:30] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10Papaul) p:05Triage→03Medium [13:02:50] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30288/console" [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [13:03:35] (03PS4) 10Filippo Giunchedi: puppetdb: wait for stockpile initialization [puppet] - 10https://gerrit.wikimedia.org/r/705705 [13:04:35] so train time [13:04:43] a few minutes late cause I was on a phone call :D [13:04:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw143[78].eqiad.wmnet [13:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:07] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30289/console" [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [13:05:48] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [13:06:07] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [13:06:35] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [13:07:00] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [13:07:26] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi the sensors for the Raritan PDU are in place [13:07:42] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] puppetdb: wait for stockpile initialization [puppet] - 10https://gerrit.wikimedia.org/r/705705 (owner: 10Filippo Giunchedi) [13:07:44] 10SRE, 10Wikimedia-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) 05Open→03Resolved a:03Pchelolo Didn't happen again in half a year. [13:08:25] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw129[34].eqiad.wmnet [13:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:13] !log mw1293, mw1294 - formerly jobrunner canaries, depooled, replaced by new jobrunner canaries mw1437, mw1438 [13:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:29] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.15 [13:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:42] !log apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/705705 to puppetdb hosts [13:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:02] that will bounce puppetdb FYI [13:10:43] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.15 (duration: 01m 13s) [13:10:46] jbond: JFYI ^ [13:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:33] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01452 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:15:07] yes that's me ^ [13:15:12] !log jiji@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=mwdebug [13:15:16] ahh puppetdb restart? [13:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:24] indeed [13:15:28] ack [13:15:56] it is back now, should recover soon [13:16:02] (03PS1) 10David Caro: ceph: Add CephOSDFlag object [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705872 [13:16:04] (03PS1) 10David Caro: ceph: Add CephStatus tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705873 [13:16:06] (03PS1) 10David Caro: ceph: fix typo Satus->Status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/705874 [13:16:19] the puppet run didn't do the right thing when changing the mount options FWIW [13:16:24] the mount still had root:root [13:16:35] I fixed it now and we're good [13:16:59] (03PS1) 10Dzahn: site/conftool/DHCP: decom mw1293, mw1294, jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/705875 (https://phabricator.wikimedia.org/T280203) [13:17:10] godog: did it just need remounting or did you have to edit fstab [13:17:27] jbond: the former, I did umount / mount [13:17:34] mount -o remount didn't do it [13:17:53] which I guess makes sense, fstab not involved [13:17:58] ack thanks suspect puppet may do mount -o remount then [13:18:25] yeah I think so too [13:18:59] ack thaks, ill take a quick look but unlckly to hit is again so dont plan to spend to much time on it [13:19:26] SGTM [13:20:06] !log jiji@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mwdebug [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:20] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10fgiunchedi) SGTM @Papaul, you can power down the host at any time [13:23:47] (03PS1) 10Effie Mouzeli: disc_desired_state: add mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/705877 [13:23:56] I'll do a cumin run to speed up recovery [13:24:11] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe) [13:24:39] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Joe) As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplified helmfile using helm3, and the command `... [13:25:10] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "Add discovery for mwdebug" [dns] - 10https://gerrit.wikimedia.org/r/705760 (owner: 10Effie Mouzeli) [13:26:08] 10SRE, 10Traffic: Sudden surge of requests to https://wikipedia.org/ from Telus customers - https://phabricator.wikimedia.org/T276213 (10Aklapper) Four months later: Is this something to still further investigate, or can this be closed? [13:28:47] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:29:47] hmm [13:34:21] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001742 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:36:34] so lvs1015/lvs1016 is unable to connect to port 8081 (sessionstore) on the k8s eqiad cluster [13:36:46] is this expected? [13:38:36] (03CR) 10Elukey: "Still trying to parse what the code review does, left a comment for a typo, don't consider me a blocker :)" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [13:38:52] (03CR) 10Jbond: [C: 03+1] "LGTM some optional comments inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [13:39:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1293.eqiad.wmnet [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:21] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:43:41] (03CR) 10Volans: Update sre.cassandra.roll-restart cookbook to use new spicerack API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [13:43:55] FYI: I acked the Prometheus jobs reduced availability Icinga alert due to some work on gitlab. 10 Minutes ago the alert fired for job=swagger_check_sessionstore_eqiad site=eqiad as well. I removed the ack because I'm finished with Gitlab but it was a little bit too late. So the alert for the sessions_store was not routed here [13:46:04] PROBLEM - LVS sessionstore eqiad port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.29 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:46:57] * volans ack'ed on VO [13:47:27] here 👋 [13:47:28] is that expected? [13:47:54] is it receiving traffic since it is eqiad? [13:48:15] it should be depooled in eqiad, checking [13:49:04] it's depooled [13:49:06] yes [13:49:09] {"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=sessionstore"} [13:49:11] good :-) [13:49:22] * jbond here [13:49:28] RECOVERY - LVS sessionstore eqiad port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.eqiad.wmnet is OK: OK - Certificate sessionstore.discovery.wmnet will expire on Tue 28 May 2024 05:38:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:49:29] (03CR) 10Jbond: [C: 03+1] Update sre.cassandra.roll-restart cookbook to use new spicerack API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [13:49:53] vgutierrez, do you have log at lvs level of why it failed? [13:49:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1293.eqiad.wmnet [13:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:03] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1293.eqiad.wmnet` - m... [13:50:14] would still like to understand though, scrolling back for context [13:50:16] jynus: realservers stopped to accept connections on that port [13:50:44] weird, I don't see any alert of a host going down or anything [13:50:47] kubernetes1008.eqiad.wmnet [10.64.0.218] 8081 (tproxy) : Connection refused [13:50:58] kubernetes1016.eqiad.wmnet [10.64.48.21] 8081 (tproxy) : Connection refused [13:51:09] ah, now the kube alerts show [13:51:54] oh, hello hello [13:51:56] https://grafana-rw.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=thanos&var-site=eqiad&var-service=sessionstore&var-prometheus=k8s&var-container_name=kask-production&from=1626832292096&to=1626875492096 [13:52:00] this might be me again o/ [13:52:02] *something* was still talking to it, and stopped at 13:28 [13:52:03] PROBLEM - mediawiki-installation DSH group on mw1294 is CRITICAL: Host mw1294 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:52:04] I ran "sudo kubectl get pods -n sessionstore" on kubemaster, all pods "Evicted" [13:52:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1294.eqiad.wmnet [13:52:17] jayme: aha :) [13:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:19] yes, it's me! [13:52:30] * jayme == ever given in the eqiad canal [13:52:31] loool [13:52:41] lol [13:53:42] (03PS1) 10Muehlenhoff: Add component/systemd241 to Udebcomponents [puppet] - 10https://gerrit.wikimedia.org/r/705891 (https://phabricator.wikimedia.org/T287036) [13:53:42] Sorry for the noise [13:54:26] Is there a way to downtime that just for sessionstore? [13:54:42] the pybal health checks I mean [13:55:02] (03CR) 10Btullis: "Abandoning this proposed change in order to explore the DNS discovery route instead." [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:55:10] nope [13:55:14] :( [13:55:28] (03Abandoned) 10Btullis: Add a CNAME for analytics-test-presto.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/705376 (https://phabricator.wikimedia.org/T273642) (owner: 10Btullis) [13:55:40] (03CR) 10Muehlenhoff: [C: 03+2] Add component/systemd241 to Udebcomponents [puppet] - 10https://gerrit.wikimedia.org/r/705891 (https://phabricator.wikimedia.org/T287036) (owner: 10Muehlenhoff) [13:55:52] Lucas_WMDE: addshore: I got a single error coming from Wikidata after pushing wmf.15 . I have marked it with #wikidata-campsite https://phabricator.wikimedia.org/T287085 [13:56:07] * Lucas_WMDE looks [13:56:36] uh oh… [13:56:54] hm, but only once so far, strange [13:57:02] that one was for wikidata.org [13:57:07] an api call of some sort [13:57:24] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Aklapper) [13:57:29] client data access it seems [13:58:18] given it is a single error so far, I am not making it a blocker [13:58:40] (03PS1) 10Jgreen: remove deprecated frpig*-fundraising A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705892 (https://phabricator.wikimedia.org/T255435) [13:59:14] 10SRE, 10Wikimedia-Mailing-lists: Create new Mailing List PRCWikimen - https://phabricator.wikimedia.org/T287083 (10Ladsgroup) Are these two email addresses belong to one person? The reason we are asking for two email addresses is to have at least two admins. [13:59:20] hm, POST to api.php, so we don’t know what they were doing :/ [13:59:42] some kind of edit apparently [14:00:22] i guess an edit trying to get a label of a sense? [14:00:38] jayme: als that won't page [14:00:41] *also [14:00:58] hashar: do you want to deploy the its-phabricator plugin ? [14:01:01] oh yeah, I can reproduce it with =mw.wikibase.getLabel('L1-S1') in a Lua console [14:01:09] woo! [14:01:15] (or rather, I can reproduce some LogicException, idk if it’s the same message, that’s not shown ^^) [14:01:24] (03CR) 10Jgreen: [C: 03+2] remove deprecated frpig*-fundraising A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705892 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [14:01:37] !log imported systemd 241-5~bpo9+wmf1 to component/systemd241 T287036 [14:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] T287036: Figure out a patched backport of systemd 241 for stretch - https://phabricator.wikimedia.org/T287036 [14:01:48] vgutierrez: ah, okay. So just the sessionstore one was paging...I can at least downtime that one [14:02:33] Lucas_WMDE: ya, [a51290de-8840-479d-8c1a-dbb9f96becc2] /w/api.php LogicException: Unable to find Service callback for Entity Type sense for Source wikidata [14:02:41] ok [14:02:47] jayme: the lvs ones.. at heads up to the lovely traffic team would suffice [14:02:56] s/at/a/ [14:03:13] Lucas_WMDE: permalink to your log is https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.07.21?id=AZFeyXoBStjVNP_Pd8iI [14:03:48] vgutierrez: yeah. Will do explicitely next time, sorry. I only did a generic heads up in -sre [14:04:23] (03PS1) 10Jgreen: Remove deprecated fundraising-[eqiad|codfw].wikimedia.org A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705893 (https://phabricator.wikimedia.org/T255435) [14:05:36] (03CR) 10Jgreen: [C: 03+2] Remove deprecated fundraising-[eqiad|codfw].wikimedia.org A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/705893 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [14:07:21] !log authdns-update to remove deprecated records related to fundraising.wikimedia.org [14:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1294.eqiad.wmnet [14:07:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:36] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1294.eqiad.wmnet` - m... [14:07:51] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:08:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:08:50] jayme: yey.. I didn't make the connection between non-ml k8s nodes and sessionstore :( [14:09:19] (03PS1) 10Muehlenhoff: Add dummy certs for ganeti test cluster [labs/private] - 10https://gerrit.wikimedia.org/r/705894 (https://phabricator.wikimedia.org/T286206) [14:09:47] (03CR) 10Dzahn: [C: 03+2] site/conftool/DHCP: decom mw1293, mw1294, jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/705875 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [14:09:54] (03PS2) 10Dzahn: site/conftool/DHCP: decom mw1293, mw1294, jobrunner canaries [puppet] - 10https://gerrit.wikimedia.org/r/705875 (https://phabricator.wikimedia.org/T280203) [14:13:53] mutante: oh yeah its-phabricator for sure we can give it a try :D [14:14:43] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:14:57] hashar: ok, go ahead if you want, I am ready to then merge the change that needs to go with it [14:15:24] pybal sessionstore is still me! Expect that to happen for another hour or so [14:15:47] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:15:50] jouncebot: now [14:15:51] No deployments scheduled for the next 3 hour(s) and 44 minute(s) [14:15:52] jouncebot: next [14:15:52] In 3 hour(s) and 44 minute(s): Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1800) [14:15:53] In 3 hour(s) and 44 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1800) [14:16:14] (03CR) 10Reedy: [C: 03+2] Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705751 (https://phabricator.wikimedia.org/T286679) (owner: 10Reedy) [14:16:22] (03PS2) 10Dzahn: gerrit: remove escapeUri [puppet] - 10https://gerrit.wikimedia.org/r/705503 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [14:16:48] (03PS3) 10Hashar: Update its-phabricator: Urlencode POST to conduit [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705650 (https://phabricator.wikimedia.org/T280197) [14:16:53] mutante: had to fix my change :D [14:17:06] (03CR) 10Hashar: [V: 03+2 C: 03+2] Update its-phabricator: Urlencode POST to conduit [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/705650 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [14:17:45] alright :) [14:17:59] PROBLEM - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 525 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:18:15] I am not sure about the sequence or whether I have to restart gerrit :D [14:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_sessionstore_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:45] hashar: on gerrit you said first upgrade plugin on server, then merge change, then restart gerrit [14:18:53] yeah [14:18:56] let me do the upgrade [14:19:12] wondering what's up with m2384 [14:19:19] !log hashar@deploy1002 Started deploy [gerrit/gerrit@a5c9d35]: Update its-phabricator: Urlencode POST to conduit # T280197 [14:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:26] T280197: Gerritbot turns "+" into space, thus breaking most Gerrit URLs - https://phabricator.wikimedia.org/T280197 [14:19:27] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy certs for ganeti test cluster [labs/private] - 10https://gerrit.wikimedia.org/r/705894 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [14:19:28] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@a5c9d35]: Update its-phabricator: Urlencode POST to conduit # T280197 (duration: 00m 09s) [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:45] mutante: you can apply the puppet patch. Looks like Gerrit reload the plugin automatically [14:21:10] hashar: ACK doing [14:21:32] (03CR) 10Dzahn: [V: 03+2] "plugin has been updated on server" [puppet] - 10https://gerrit.wikimedia.org/r/705503 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [14:21:36] it might do the same for the templates [14:21:59] 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) Case Reference ID: 5357298848 Status: Case is generated and in Progress Subject: HPE ProLiant DL360 Gen10 - DIMM Failed P... [14:22:35] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2384 is CRITICAL: CRITICAL: 525 mismatched wikiversions daniel_zahn T286463 https://wikitech.wikimedia.org/wiki/Application_servers [14:22:44] (03CR) 10Dzahn: [V: 03+2 C: 03+2] gerrit: remove escapeUri [puppet] - 10https://gerrit.wikimedia.org/r/705503 (https://phabricator.wikimedia.org/T280197) (owner: 10Hashar) [14:23:31] hashar: do you need to do both servers? codfw? [14:23:42] just eqiad [14:23:52] ok, it's been applied .. now [14:24:20] thanks [14:24:23] I will play my test case [14:24:30] cool [14:26:30] that works :) [14:27:08] I will capture in the doc that plugin an dtemplate can be updated on the fly ;) [14:27:16] great:) [14:27:58] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [14:28:38] (03CR) 10MSantos: [C: 03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/705859 (owner: 10Jgiannelos) [14:29:58] (03PS1) 10Ladsgroup: dumps: Drop absented cron in kiwix [puppet] - 10https://gerrit.wikimedia.org/r/705898 (https://phabricator.wikimedia.org/T273673) [14:30:55] (03PS1) 10Muehlenhoff: Add cert for ganeti-test RAPI [puppet] - 10https://gerrit.wikimedia.org/r/705899 (https://phabricator.wikimedia.org/T286206) [14:32:12] (03PS1) 10Ladsgroup: mariadb: Drop absented cron in check_private_data [puppet] - 10https://gerrit.wikimedia.org/r/705901 (https://phabricator.wikimedia.org/T273673) [14:33:44] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw143[7-8].eqiad.wmnet,service=canary [14:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:32] (03PS2) 10Volans: decorators: improve the retry decorator [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 [14:35:51] (03CR) 10Volans: "reply inline" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/704343 (owner: 10Volans) [14:36:50] mutante: it is definitely a success. We might need a slight adjustement later eventually. Thank you for stepping in! [14:36:50] (03CR) 10Muehlenhoff: [C: 03+2] Add cert for ganeti-test RAPI [puppet] - 10https://gerrit.wikimedia.org/r/705899 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [14:36:59] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/VisualEditor] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705751 (https://phabricator.wikimedia.org/T286679) (owner: 10Reedy) [14:37:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:38:05] hashar: you're welcome, just wanted to get the review done and it seemed this time woudl work :) [14:38:19] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:38:35] yeah I wanted to double check the behavior of gerrit when pushing a new plugin or how to get the templates update taken in account [14:40:19] (03CR) 10Volans: [V: 03+2 C: 03+2] Fix group assignement in CAS-SSO support [software/netbox] - 10https://gerrit.wikimedia.org/r/705358 (owner: 10Volans) [14:40:27] !log powerdown ms-be2038 for BBU replacement [14:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:47] PROBLEM - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_sessionstore_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:43:51] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:45:13] PROBLEM - Ensure local MW versions match expected deployment on mw1438 is CRITICAL: CRITICAL: 525 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:46:18] !log reedy@deploy1002 Started scap: Fix some VE translation issues for T286679 [14:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:25] T286679: Some interface messages are in Welsh when British English is selected as the interface language - https://phabricator.wikimedia.org/T286679 [14:46:57] PROBLEM - Ensure local MW versions match expected deployment on mw1437 is CRITICAL: CRITICAL: 525 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [14:46:58] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10nettrom_WMF) >>! In T286746#7227261, @toberto wrote: > @nettrom_WMF Hi Morten! I've signed the doc and completed info up above. One question: i... [14:48:40] (03CR) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:48:55] (03PS7) 10Dzahn: miscweb: add a define for the httpd prometheus exporter and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/700522 (https://phabricator.wikimedia.org/T281538) [14:49:17] RECOVERY - Host ms-be2038 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:51:03] !log reedy@deploy1002 Finished scap: Fix some VE translation issues for T286679 (duration: 04m 45s) [14:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] RECOVERY - Ensure local MW versions match expected deployment on mw1437 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [14:52:35] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:52:38] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30290/console" [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [14:53:36] 10SRE, 10ops-codfw: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T286698 (10Papaul) 05Open→03Resolved BBU replaced server is back online [14:55:06] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7227228, @Joe wrote: > As a data point: the rolling restart shouldn't be an issue. I just tested the mechanism I created for the mwdebug deployment in a simplifi... [14:55:53] (03PS1) 10Dzahn: site/conftool: add mw1439, mw1440 as jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/705927 (https://phabricator.wikimedia.org/T279309) [14:56:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:45] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:58:13] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10toberto) All good and thanks, @nettrom_WMF ! I'll review the user responsibilities and be on standby if you need anything else from me. Thanks! [15:02:18] RECOVERY - Ensure local MW versions match expected deployment on mw1438 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [15:04:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 (owner: 10Volans) [15:04:24] (03PS1) 10Hashar: Merge 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit/plugins/gitiles] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705929 (https://phabricator.wikimedia.org/T262241) [15:04:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 (owner: 10Volans) [15:05:48] (03PS2) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [15:08:12] (03PS1) 10Jelto: make prometheus exporters reachable [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) [15:08:45] (03CR) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:11:05] !log installing apt bugfix updates from Buster 10.10 point release [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:10] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:13:26] hashar: FYI in case wasn't already noticed, it seems that gitiles is showing all files in bold now, regardless of the language. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/README [15:13:50] not sure if expected/wanted/a default that changed/etc... so just mentioning it :) [15:17:01] !log installing intel-microcode security updates on stretch [15:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 (owner: 10Volans) [15:18:20] (03PS2) 10Jelto: make prometheus exporters reachable [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) [15:19:33] 10SRE: Integrate Buster 10.10 point update - https://phabricator.wikimedia.org/T285206 (10MoritzMuehlenhoff) [15:20:26] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI if that helps this is the current row-distribution of the API appservers in eqiad: ` {'B': 19, 'D': 18, 'C': 17, 'A': 9} ` Full details at P16841 [15:21:07] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:21:31] (03PS4) 10Volans: sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 [15:21:43] (03CR) 10Jelto: [C: 03+1] "lgtm, I hope we find a easier solution when migrating to puppet :)" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) (owner: 10Brennen Bearnes) [15:22:15] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: complete restbase transition to ECS [puppet] - 10https://gerrit.wikimedia.org/r/705729 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:25:52] (03CR) 10Elukey: [C: 03+1] "Good job thanks!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:26:04] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 (owner: 10Volans) [15:26:18] (03PS4) 10Volans: sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 [15:26:35] (03PS1) 10Hashar: Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) [15:26:52] (03CR) 10Herron: [C: 03+1] logstash: complete restbase transition to ECS [puppet] - 10https://gerrit.wikimedia.org/r/705729 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:27:38] (03CR) 10Hashar: "This "might" be sufficient. Then I have used wmf-plugins-update.sh which update our submodules from the remote. We might bring unrelated" [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [15:29:11] (03Merged) 10jenkins-bot: sre.hosts.downtime: downtime any Icinga host [cookbooks] - 10https://gerrit.wikimedia.org/r/705428 (owner: 10Volans) [15:29:55] (03CR) 10Brennen Bearnes: make prometheus exporters reachable (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [15:30:00] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 (owner: 10Volans) [15:30:07] (03PS3) 10Volans: sre.hosts.remove-downtime: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/705430 [15:32:03] (03PS3) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [15:34:18] (03Merged) 10jenkins-bot: sre.hosts.downtime: convert format() to f-string [cookbooks] - 10https://gerrit.wikimedia.org/r/705429 (owner: 10Volans) [15:34:35] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw143[7-8].eqiad.wmnet,service=canary [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:40] btullis: merge race? :-) [15:35:13] !log dzahn@cumin1001 conftool action : set/weight=10; selector: name=mw143[7-8].eqiad.wmnet [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:42] (03CR) 10MSantos: [C: 04-1] maps: make maps2010 a buster replica of maps2009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [15:36:28] volans: I didn't try to merge anything. Not sure I understand, sorry. Do you want me to rebase cookbooks? [15:37:10] btullis: nah, I thought you were rebasing because I merged and so you had to rebase because of me, and as I was merging 3 of them in a row didn't want you to get annoyed hitting rebase all the time [15:37:32] I'm done btw, so all yours :) [15:37:59] (03PS3) 10Jelto: make prometheus exporters reachable [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) [15:38:04] Oh I see. No problem, thanks. I can rebase all day. [15:38:08] lol [15:38:10] rzl: hi, just a quick check about https://gerrit.wikimedia.org/r/c/operations/puppet/+/704506 -- is there anything I should do to make a merge possible? 🙂 [15:38:16] (03PS4) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [15:38:35] (03PS7) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [15:40:49] (03PS2) 10Herron: logstash: add logstash200[123] to v7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) [15:41:07] urbanecm: nope you're good, I'm just behind in my morning loop :) be right there, sorry [15:41:34] no problem -- I just remember you saying you'll look in morning, that's why i ask :) [15:42:48] (03CR) 10Herron: [C: 03+2] logstash: add logstash200[123] to v7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/701611 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [15:44:43] (03PS1) 10Volans: sre.hosts.downtime: fix typo in message [cookbooks] - 10https://gerrit.wikimedia.org/r/705938 [15:45:46] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on 10.3.0.1,cumin1001.mgmt,debmonitor.wikimedia.org with reason: testing new feature [15:45:49] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on 10.3.0.1,cumin1001.mgmt,debmonitor.wikimedia.org with reason: testing new feature [15:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:18] (03PS1) 10Herron: install_server: use default os installer version on logstash200[123] [puppet] - 10https://gerrit.wikimedia.org/r/705939 [15:46:36] 10SRE, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10dpifke) Sorry, this kinda dropped off my radar. The object-expirer works (it's been running in deployment-prep with no issues fo... [15:48:04] (03CR) 10Herron: [C: 03+2] install_server: use default os installer version on logstash200[123] [puppet] - 10https://gerrit.wikimedia.org/r/705939 (owner: 10Herron) [15:50:24] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash2001.codfw.wmnet ` The log can be found in `/... [15:53:09] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) Hi Toni, welcome to the Foundation! I can get you set up here. Thanks @nettrom_WMF for getting us started. (For future requests, note... [15:54:03] jouncebot: now [15:54:03] No deployments scheduled for the next 2 hour(s) and 5 minute(s) [15:54:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/705938 (owner: 10Volans) [15:54:22] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) [15:54:29] is it okay if I do a backport for https://phabricator.wikimedia.org/T287085? [15:55:19] (03CR) 10RLazarus: [C: 03+2] mediawiki/maintenance/growthexperiments.pp: Run updateMenteeData every day [puppet] - 10https://gerrit.wikimedia.org/r/704506 (https://phabricator.wikimedia.org/T285811) (owner: 10Urbanecm) [15:55:45] (03CR) 10Volans: "Actually I think we might improve it." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [15:55:54] Thanks rzl! [15:55:55] (03PS1) 10Lucas Werkmeister (WMDE): Define PREFETCHING_TERM_LOOKUP for all types in client and repo [extensions/WikibaseLexeme] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705912 (https://phabricator.wikimedia.org/T287085) [15:55:59] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: fix typo in message [cookbooks] - 10https://gerrit.wikimedia.org/r/705938 (owner: 10Volans) [15:57:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Define PREFETCHING_TERM_LOOKUP for all types in client and repo [extensions/WikibaseLexeme] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705912 (https://phabricator.wikimedia.org/T287085) (owner: 10Lucas Werkmeister (WMDE)) [15:57:30] ^ I’ve hit the +2 button, you still have 20 minutes or so to tell me not to backport :) [15:58:42] (03Merged) 10jenkins-bot: sre.hosts.downtime: fix typo in message [cookbooks] - 10https://gerrit.wikimedia.org/r/705938 (owner: 10Volans) [16:01:24] (03CR) 10Brennen Bearnes: "Looks good, I think. I'll test this and the ECS logging change (705715) together against the Ansible test box." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [16:01:55] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] logging: format nginx access logs as JSON [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705715 (https://phabricator.wikimedia.org/T274462) (owner: 10Brennen Bearnes) [16:02:08] (03PS4) 10Brennen Bearnes: make prometheus exporters reachable [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [16:02:13] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10toberto) Thanks for your patience with us, @RLazarus , while we go off protocol! I have messaged Shari and asked she email or Slack you with ap... [16:04:18] (03PS8) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [16:04:21] (03PS1) 10Dzahn: conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) [16:04:30] (03PS5) 10Btullis: Update sre.cassandra.roll-restart cookbook to use new spicerack API [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) [16:04:56] !log upload cas_6.3.2-1+wmf10u1 to apt [16:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:12] (03CR) 10Elukey: "Thanks looks more clear now :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/705869 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:05:37] (03PS2) 10Dzahn: conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) [16:08:30] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1421.eqiad.wmnet ` The log can be found in `/var/log/wmf-... [16:09:32] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1421.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1421.eqiad.wmnet'] ` [16:11:04] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:18] (03PS1) 10Bartosz Dziewoński: Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705913 (https://phabricator.wikimedia.org/T287035) [16:16:28] (03PS1) 10Bartosz Dziewoński: Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705914 (https://phabricator.wikimedia.org/T287035) [16:16:51] (03PS3) 10Dzahn: conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) [16:17:23] (03Merged) 10jenkins-bot: Define PREFETCHING_TERM_LOOKUP for all types in client and repo [extensions/WikibaseLexeme] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705912 (https://phabricator.wikimedia.org/T287085) (owner: 10Lucas Werkmeister (WMDE)) [16:17:35] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2001.codfw.wmnet with reason: REIMAGE [16:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:57] hashar: can I deploy that backport? [16:18:11] (asking since mediawiki-staging on deploy1002 is currently at u+1, which I’m not used to ^^) [16:19:38] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2001.codfw.wmnet with reason: REIMAGE [16:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:46] (03PS4) 10Dzahn: conftool: convert mw1421, mw1422 from app to API servers for balance [puppet] - 10https://gerrit.wikimedia.org/r/705943 (https://phabricator.wikimedia.org/T279309) [16:21:04] testing the WikibaseLexeme backport on mwdebug2001… [16:21:46] (03PS5) 10Hnowlan: maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) [16:21:48] seems to work, syncing [16:23:50] !log lucaswerkmeister-wmde@deploy1002 scap failed: average error rate on 2/6 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [16:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:11] 10SRE, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10dpifke) Looks like we're already tracking DELETEs, e.g. the second panel in https://grafana-rw.wikimedia.org/d/OPgmB1Eiz/swift?or... [16:27:04] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/WikibaseLexeme/WikibaseLexeme.entitytypes.php: restore previous state after previous scap failed on canaries with seemingly legitimate error (duration: 01m 04s) [16:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:29] I think that’s due to the first sync on the canaries, which I reverted now? [16:28:31] checking… [16:28:50] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1032 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:29:09] at least that's lower than the previous one [16:29:39] nah, it’s broken right now. fuck [16:29:42] let me see… [16:31:05] trying another sync now [16:31:09] might have to force it [16:31:11] wikidata is down right now [16:31:17] so it might not make it past the canaries [16:31:28] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1415.eqiad.wmnet, mw1271.eqiad.wmnet, mw1399.eqiad.wmnet, mw1420.eqiad.wmnet, mw1418.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1325.eqiad.wmnet, mw1369.eqiad.wmnet, mw1367.eqiad [16:31:28] mw1368.eqiad.wmnet, mw1373.eqiad.wmnet, mw1332.eqiad.wmnet, mw1422.eqiad.wmnet, mw1414.eqiad.wmnet, mw1417.eqiad.wmnet, mw1371.eqiad.wmnet, mw1322.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw1326.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1407.eqiad.wmnet, mw1324.eqiad.wmnet, mw1331.eqiad.wmnet, mw1321.eqiad.wmnet, mw1403.eqiad.wmnet, mw [16:31:28] ad.wmnet, mw1411.eqiad.wmnet, mw1328.eqiad.wmnet, mw1353.eqiad.wmnet, mw1416.eqiad.wmnet, mw1330.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:31:34] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2255.codfw.wmnet, mw2313.codfw.wmnet, mw2316.codfw.wmnet, mw2336.codfw.wmnet, mw2274.codfw.wmnet, mw2359.codfw.wmnet, mw2312.codfw.wmnet, mw2353.codfw.wmnet, mw2371.codfw.wmnet, mw2310.codfw.wmnet, mw2338.codfw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2314.codfw.wmnet, mw2258.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw [16:31:35] mw2257.codfw.wmnet, mw2387.codfw.wmnet, mw2269.codfw.wmnet, mw2373.codfw.wmnet, mw2406.codfw.wmnet, mw2361.codfw.wmnet, mw2327.codfw.wmnet, mw2386.codfw.wmnet, mw2355.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2389.codfw.wmnet, mw2268.codfw.wmnet, mw2301.codfw.wmnet, mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw2391.codfw.wmnet, mw [16:31:35] fw.wmnet, mw2311.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2357.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:31:39] yes thank you i am intensely aware *sweats* [16:31:45] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - apaches_80: Servers mw2313.codfw.wmnet, mw2254.codfw.wmnet, mw2357.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2359.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2310.codfw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2389.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw.wmnet, mw2257.codfw [16:31:45] mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2361.codfw.wmnet, mw2353.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2351.codfw.wmnet, mw2385.codfw.wmnet, mw2274.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2336.codfw.wmnet, mw2333.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw2258.codfw.wmnet, mw [16:31:45] fw.wmnet, mw2311.codfw.wmnet, mw2367.codfw.wmnet, mw2390.codfw.wmnet, mw2371.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:31:53] it did make it past the canaries [16:31:57] let’s hope [16:32:04] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/WikibaseLexeme/WikibaseLexeme.entitytypes.repo.php: Backport: [[gerrit:705912|Define PREFETCHING_TERM_LOOKUP for all types in client and repo (T287085)]] (2/2) – I think this might be the fastest way to fix the errors (duration: 01m 05s) [16:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:10] T287085: LogicException: Unable to find Service callback for Entity Type sense for Source wikidata - https://phabricator.wikimedia.org/T287085 [16:32:33] fixed in my browser, let’s see what logstash et al say [16:32:40] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [16:32:57] ^ could be due to the same issue, not sure [16:33:07] Lucas_WMDE: thanks for the revert, what is the status? Rollback completed? [16:33:14] (just to understand) [16:33:19] no, rollout completed [16:33:31] it was, apparently, not suited for a file-by-file sync, which I did not expect [16:33:35] so I synced the second file [16:33:40] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:33:41] and that seems to have fixed it [16:33:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:33:42] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:33:48] i sped up the Icinga checks ^ [16:33:49] I’ll write more on Phabricator after finishing cleanup [16:33:53] thanks! [16:34:02] or I could write an incident report on Wikitech if you think it deserves one [16:34:18] wikidata up for me [16:34:19] from logstash it looks like only Wikidata was broken, not other wikis (which is a relief) [16:34:34] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:34:35] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 27 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:35:22] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:35:35] error rate did go up on appservers for 5 minutes but is normal again [16:35:46] glad to see recoveries, seems ok [16:35:53] Lucas_WMDE: not sure what you want to do with https://phabricator.wikimedia.org/T287100 [16:36:10] wow, someone was really fast with that [16:36:17] then I’ll write a bit more there, thanks [16:36:21] let me clean up deploy1002 first [16:36:45] alright, I think I’m done on the deployment server [16:36:54] Yeah [16:36:54] *scrolls up* [16:36:58] Very eager person [16:37:40] wow, they made a ticket within that 5 min time frame.. community fastest monitoring still [16:38:05] Always mutante [16:39:14] Guarntee that they'll report all minor 5 second outages you're aware of but when it comes to a major issue that you don't know about they'll wait half an hour before telling you [16:40:00] I’m writing a phab comment now [16:40:03] 10SRE, 10GrowthExperiments-MentorDashboard, 10Growth-Team (Current Sprint), 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 2 others: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 (10Urbanecm_WMF) [16:40:10] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2001.codfw.wmnet'] ` and were **ALL** successful. [16:40:29] ACKNOWLEDGEMENT - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:49] ACKNOWLEDGEMENT - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:41:13] thanks Lucas_WMDE , Icinga is all fine [16:41:23] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [16:43:31] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash2002.codfw.wmnet ` The log can be found in `/... [16:47:58] mutante, addshore: writeup at https://phabricator.wikimedia.org/T287100#7228015 [16:48:05] oh and elukey ^ [16:49:00] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:49:59] !log [urbanecm@mwmaint2002 ~]$ time /usr/local/bin/mw-cli-wrapper /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/updateMenteeData.php # T285811 [16:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:06] T285811: Mentee overview module: Run updateMenteeData.php regularly - https://phabricator.wikimedia.org/T285811 [16:51:02] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) The next step on productionization of workers is to setup the account for access to mw content on swift. This is documented at: https://wikitech.wikim... [16:51:03] PROBLEM - SSH access on gerrit1001 is CRITICAL: connect to address 208.80.154.137 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [16:51:31] gerrit down is known [16:51:38] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Volans) FYI I've updated the pastes for eqiad and codfw with some more detailed data, all yours now :) [16:51:39] there is a config glitch that we supposedly have fixed yesterday [16:51:41] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:56] hashar: I hope it'll be fine soon :) [16:52:03] RECOVERY - SSH access on gerrit1001 is OK: SSH OK - GerritCodeReview_3.2.11 (APACHE-SSHD-2.4.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [16:52:14] welcome back, gerrit [16:53:42] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:29] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2002.codfw.wmnet with reason: REIMAGE [17:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:52] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2002.codfw.wmnet with reason: REIMAGE [17:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:12] (03PS1) 10Majavah: Drop kubeadm 1.17 remains [puppet] - 10https://gerrit.wikimedia.org/r/705969 [17:24:08] (03PS10) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [17:24:34] (03CR) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [17:29:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [17:32:26] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2002.codfw.wmnet'] ` and were **ALL** successful. [17:34:29] (03PS11) 10RLazarus: icinga: Write to Icinga command file instead of calling icinga-downtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) [17:46:06] (03CR) 10Bstorm: "I think I'd like to propose we split this. On one hand, changing the defaults should happen ASAP. On the other, we normally keep the last " [puppet] - 10https://gerrit.wikimedia.org/r/705969 (owner: 10Majavah) [17:47:42] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash2003.codfw.wmnet ` The log can be found in `/... [17:49:15] (03PS2) 10Majavah: Drop kubeadm 1.17 remains [puppet] - 10https://gerrit.wikimedia.org/r/705969 [17:51:52] (03PS1) 10Majavah: aptrepo: Init thirdparty/kubeadm-k8s-1-19 [puppet] - 10https://gerrit.wikimedia.org/r/705972 (https://phabricator.wikimedia.org/T280340) [17:52:45] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10Jgreen) a:05Jgreen→03None [17:55:32] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:56:02] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:56:22] PROBLEM - OSPF status on mr1-codfw is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:56:34] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:56:47] ^ big spike in appserver->memcache request rate, that's weird [17:57:02] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:57:13] request rate and also error rate, hence the increase in appserver latency [17:57:37] fully cleared on its own though [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1800). [18:00:05] MatmaRex: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:05] hashar and dancy: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1800). [18:00:29] I can deploy today [18:00:30] ^ it is done. Did it at roughly 13:00 utc [18:00:41] oh no sorry I have missread [18:00:58] the scheduled patches aren't deployed AFAICS [18:01:03] MatmaRex: hi, are you around? [18:01:15] hi [18:01:20] (03CR) 10Urbanecm: [C: 03+2] Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705913 (https://phabricator.wikimedia.org/T287035) (owner: 10Bartosz Dziewoński) [18:01:22] (03CR) 10Urbanecm: [C: 03+2] Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705914 (https://phabricator.wikimedia.org/T287035) (owner: 10Bartosz Dziewoński) [18:01:24] great :) [18:01:30] I'll ping you when ready for tests [18:04:17] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:20] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=no [18:06:20] 1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:07:28] (03Merged) 10jenkins-bot: Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.14) - 10https://gerrit.wikimedia.org/r/705913 (https://phabricator.wikimedia.org/T287035) (owner: 10Bartosz Dziewoński) [18:07:30] (03Merged) 10jenkins-bot: Do not teardown newtopictool interface if it was not setup [extensions/DiscussionTools] (wmf/1.37.0-wmf.15) - 10https://gerrit.wikimedia.org/r/705914 (https://phabricator.wikimedia.org/T287035) (owner: 10Bartosz Dziewoński) [18:08:24] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:10] hashar: deployment dir at deploy1002 is dirty, can you fix that? [18:09:27] sure checking [18:09:37] there's `group1 wikis to 1.37.0-wmf.15`, but that commit is not in the master branch on gerrit [18:09:57] hmm something went wrong so [18:10:00] rzl: the spike also caused "db too many connectiosn" for es* dbs [18:10:01] or I forgot to push it maybe [18:10:34] (03PS1) 10Hashar: group1 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705974 [18:10:50] urbanecm: ^ I forgot to push it for review :/ [18:11:01] hashar: happens :) [18:11:06] (03CR) 10Hashar: [C: 03+2] "I forgot to push it for review earlier. Sorry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705974 (owner: 10Hashar) [18:11:15] MatmaRex: backport pulled to mwdebug2001, can you have a look? [18:11:29] we are looking at automatizing more of the train [18:11:35] looking [18:11:38] hashar: Sounds like you're not using the deploy-promote script [18:11:40] and get rid of those kind of annoyances. Thanks for the ping! [18:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [18:11:48] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/705974 (owner: 10Hashar) [18:12:00] git fetch fixed it [18:12:02] thanks hashar [18:12:57] urbanecm: looks good at https://test2.wikipedia.org/wiki/Talk:Main_Page [18:13:01] dancy: ~/release/bin/deploy-promote group1 , but maybe I messed up cause my ssh key to reach gerrit was not loaded [18:13:12] great MatmaRex, syncing [18:13:25] ah, did you end up control-c'ing? I did that too recently and ran into the same problem. I have a todo item to fix that. [18:14:11] ^hashar [18:14:14] dancy: possibly yes [18:14:23] ok. I'll bump priority on fixing that. [18:14:55] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2003.codfw.wmnet with reason: REIMAGE [18:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:04] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.14/extensions/DiscussionTools/modules/ReplyLinksController.js: aca510b773a67d24452731d5d6a33952c57592b8: Do not teardown newtopictool interface if it was not setup (T287035) (duration: 01m 05s) [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:11] T287035: `jQuery.Deferred exception: this.$addSectionLink is undefined` after posting a reply with DiscussionTools - https://phabricator.wikimedia.org/T287035 [18:15:47] rzl: we lost a linecard on cr2-codfw, so probably some re-convergence (/cc topranks ) [18:16:04] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 1.519e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:16:09] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.15/extensions/DiscussionTools/modules/ReplyLinksController.js: 1453831db13e17e550a86dd99d09dc26eeb242b1: Do not teardown newtopictool interface if it was not setup (T287035) (duration: 01m 04s) [18:16:11] MatmaRex: should be live! anything else? [18:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:23] urbanecm: thanks [18:16:28] any time [18:16:43] !log [Elastic] Depooled `wdqs1004` and restarted `wdqs-blazegraph` [18:16:47] we lost all the cr2-codfw to all the codfw asw links [18:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:02] so total loss of redundancy in codfw [18:17:17] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2003.codfw.wmnet with reason: REIMAGE [18:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:51] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:17:59] !log [WDQS] Depooled `wdqs1004` and restarted `wdqs-blazegraph` [18:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:02] !log cr2-codfw> request chassis fpc slot 0 restart [18:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:29] re0.cr2-codfw> request chassis fpc slot 0 restart [18:19:29] FPC 0 is in transition, try again [18:19:55] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:01] XioNoX: ah thanks [18:20:07] showing absent now in "show chassis fpc" [18:20:38] sry... temp absent, State is "presnet" [18:21:09] I put stuff on https://etherpad.wikimedia.org/p/cr2-codfw_fpc0 [18:21:31] !log [WDQS] Restarted `wdqs-updater` on `wdqs1004` [18:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:37] that looks like hardware failure to me [18:22:57] would seem to be. [18:23:14] opening a jtac case [18:23:27] topranks: can you check if there are no signs of issues on cr1 ? [18:23:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 85, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:40] yep no probs. [18:24:13] they probably have the same birthday [18:26:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2043.codfw.wmnet ` The log... [18:27:11] !log T281327 [Elastic] `sudo -i wmf-auto-reimage-host -p T281327 elastic2043.codfw.wmnet` on `ryankemper@cumin2001` tmux session `reimage_elastic2043` [18:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:19] T281327: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 [18:29:56] https://phabricator.wikimedia.org/T287110 [18:30:10] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) Received from Shari Wakiyama on her WMF email account: > Hi Reuven, > > Please give Toni Oberto, cc'd here, access to Superset to f... [18:30:23] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog: Access request to Superset for toberto - https://phabricator.wikimedia.org/T286746 (10RLazarus) [18:33:27] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1004 is CRITICAL: 1.509e+05 ge 4.32e+04 Ryan Kemper host depooled, recovering on 1.5 days of lag https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:33:45] XioNox: nothing jumping out at me on cr1-codfw. [18:34:34] Increase in usage on various ports (transport, transits and out to asw's), but nothing maxing out. [18:34:52] CPU, mem etc. seem ok [18:34:54] great, thanks! [18:34:58] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 3 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [18:35:08] topranks: fyi: https://gist.github.com/XioNoX/42a61288c3f638ecd1bbd24b13d6c483 [18:35:45] haha wow. Nice work :) [18:36:51] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash2003.codfw.wmnet'] ` and were **ALL** successful. [18:44:38] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:47:37] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: SSH failure for wdqs2002.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T287112 (10RKemper) [18:49:19] 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: SSH failure for wdqs2002.mgmt.codfw.wmnet - https://phabricator.wikimedia.org/T287112 (10RKemper) [18:54:15] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Tested against gitlab-ansible-test. Deploying." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [18:55:13] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Tested" (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/705930 (https://phabricator.wikimedia.org/T275170) (owner: 10Jelto) [19:00:05] hashar and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1900). [19:01:06] 1.37.0-wmf.15 is already at group1 so no train operation during this window. [19:01:38] (03CR) 10Effie Mouzeli: "pcc ok https://puppet-compiler.wmflabs.org/compiler1001/30293/mw1380.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/705852 (owner: 10Effie Mouzeli) [19:02:33] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/705852 [19:07:43] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) {F34558993} Mcrouter instances in codfw are connecting directly to memeched hosts in eqiad [19:08:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) [19:08:52] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [19:09:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) 05Open→03Resolved a:03jijiki [19:25:33] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) a:03Legoktm [19:26:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2043.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2043.codfw.wmnet'] ` [19:26:51] 10ops-codfw: decommission procyon - https://phabricator.wikimedia.org/T287114 (10Papaul) [19:27:48] 10ops-codfw: decommission procyon - https://phabricator.wikimedia.org/T287114 (10Papaul) p:05Triage→03Medium [19:27:54] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10cmooney) First related log I can find referencing FPC (ae interface down logs were before). Jul 21, 2021 @ 17:53:34.000 CMTFPC: Fabric request time out pfe 0 plane 1 pg 0, trying recovery.... [19:29:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:28] (03PS1) 10Legoktm: Enable Score via Shellbox on enwikisource and plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706020 (https://phabricator.wikimedia.org/T257066) [20:00:05] hashar and dancy: Your horoscope predicts another unfortunate MediaWiki train - American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T1900). [20:00:05] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T2000). [20:01:51] (03CR) 10MSantos: [C: 03+1] maps: make maps2010 a buster replica of maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/702615 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [20:02:16] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:38] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:22:07] (03PS1) 10Andrew Bogott: toolforge grid master: run disable_tool.py every 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) [20:26:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:50] !log testing upcoming Scap release on beta [20:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:54] (03PS1) 10Andrew Bogott: toolforge grid-engine cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) [20:33:56] (03PS1) 10Andrew Bogott: nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) [20:34:43] (03CR) 10Majavah: toolforge grid master: run disable_tool.py every 10 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [20:35:26] (03CR) 10jerkins-bot: [V: 04-1] toolforge grid-engine cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) (owner: 10Andrew Bogott) [20:38:00] (03PS1) 10Cwhite: logging: fix typo [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/706036 (https://phabricator.wikimedia.org/T274462) [20:39:52] (03PS2) 10Andrew Bogott: toolforge cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) [20:39:54] (03PS2) 10Andrew Bogott: nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) [20:40:25] (03CR) 10Volans: "Looks good, couple of minor things inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [20:47:28] (03PS1) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/706038 [20:48:41] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [20:48:53] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [20:49:09] (03PS1) 10Urbanecm: urbanecm's dotfiles: Add .tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/706039 [20:49:54] could a nice SRE please merge ^^ for me? 🙂 thanks [20:50:48] yep, one min [20:50:58] (03CR) 10Kormat: [C: 03+2] urbanecm's dotfiles: Add .tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/706039 (owner: 10Urbanecm) [20:51:28] urbanecm: any machines you want that deployed to immediately? [20:51:51] kormat: nope, I have this locally at the most important machines, I can wait the 30 mins :). Thanks for the offer, and the merge. [20:52:02] np. happy tmuxing :) [20:52:06] thanks [20:52:21] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) I added a checklist, assuming that containers already exist for everything, I think the next step would be to start creating a helm chart and test it locally with... [21:01:40] (03CR) 10Volans: "Nice! Final (hopefully) nits, but nothing is a blocker." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705500 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [21:02:38] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:41] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] logging: fix typo [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/706036 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [21:04:26] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:24] 10SRE, 10Services, 10Toolhub, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) >>! In T280881#7228714, @Legoktm wrote: > * What metrics will be exposed by Toolhub? Will it need additional prometheus exporters as sidecars? And do we already hav... [21:05:45] (03PS1) 10Hashar: gerrit: config values do not need double quotes [puppet] - 10https://gerrit.wikimedia.org/r/706042 (https://phabricator.wikimedia.org/T287122) [21:07:48] PROBLEM - mediawiki-installation DSH group on mw1421 is CRITICAL: Host mw1421 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:10:59] (03PS1) 10Hashar: gerrit: remove SMTP encryption option [puppet] - 10https://gerrit.wikimedia.org/r/706043 (https://phabricator.wikimedia.org/T287122) [21:18:14] (03CR) 10Andrew Bogott: toolforge grid master: run disable_tool.py every 10 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [21:18:27] (03CR) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:18:31] (03PS4) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) [21:19:08] (03PS5) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) [21:20:51] (03CR) 10Bstorm: toolforge grid master: run disable_tool.py every 10 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [21:24:50] (03PS2) 10Andrew Bogott: toolforge grid master: run disable_tool.py every 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) [21:24:52] (03PS3) 10Andrew Bogott: toolforge cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) [21:24:54] (03PS3) 10Andrew Bogott: nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) [21:24:57] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:25:17] (03CR) 10Andrew Bogott: toolforge grid master: run disable_tool.py every 10 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [21:25:24] (03CR) 10jerkins-bot: [V: 04-1] toolforge grid master: run disable_tool.py every 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) (owner: 10Andrew Bogott) [21:25:28] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:47] (03CR) 10jerkins-bot: [V: 04-1] toolforge cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) (owner: 10Andrew Bogott) [21:26:06] (03CR) 10jerkins-bot: [V: 04-1] nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) (owner: 10Andrew Bogott) [21:28:38] (03CR) 10Legoktm: "The new pylint error seems unrelated..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:28:56] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/705019 (https://phabricator.wikimedia.org/T274462) (owner: 10Cwhite) [21:29:14] (03PS3) 10Andrew Bogott: toolforge grid master: run disable_tool.py every 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/706033 (https://phabricator.wikimedia.org/T284940) [21:29:16] (03PS4) 10Andrew Bogott: toolforge cron: use disable_tool.py to archive crontab for disabled tools [puppet] - 10https://gerrit.wikimedia.org/r/706034 (https://phabricator.wikimedia.org/T284946) [21:29:18] (03PS4) 10Andrew Bogott: nfs: use disable-tool.py to archive disabled+expired tools [puppet] - 10https://gerrit.wikimedia.org/r/706035 (https://phabricator.wikimedia.org/T170355) [21:31:51] (03CR) 10Legoktm: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:35:43] (03CR) 10Volans: "> Patch Set 5:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:45:46] (03CR) 10RLazarus: [C: 03+1] "Thanks for doing this! LGTM for the approach with a nonblocking question, I haven't nitpicked the implementation but I trust Volans's revi" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [21:53:25] (03CR) 10Legoktm: mysql_legacy: Re-add x2 and properly support active/active sections (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [22:01:48] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:14] (03CR) 10Cwhite: [C: 03+1] hieradata: add o11y services to service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/705343 (owner: 10Filippo Giunchedi) [22:10:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [22:11:47] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [22:16:30] (03CR) 10Legoktm: [C: 03+2] mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [22:18:20] (03PS1) 10Bstorm: tools prometheus: allow scraping by ip address [puppet] - 10https://gerrit.wikimedia.org/r/706047 [22:22:11] (03Merged) 10jenkins-bot: mysql_legacy: Re-add x2 and properly support active/active sections [software/spicerack] - 10https://gerrit.wikimedia.org/r/701474 (https://phabricator.wikimedia.org/T285519) (owner: 10Legoktm) [22:24:28] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:01] (03CR) 10Bstorm: "The tools-prometheus instance uses a python script to get hosts to scrape from openstack. This is ultimately an unnecessary step for node " [puppet] - 10https://gerrit.wikimedia.org/r/706047 (owner: 10Bstorm) [22:26:38] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2043.codfw.wmnet ` The log... [22:32:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2043.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2043.codfw.wmnet'] ` [22:32:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts: ` elastic2043.codfw.wmnet ` The log... [22:32:45] 10SRE, 10docker-pkg, 10serviceops, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) p:05Triage→03Medium [22:33:20] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [22:35:30] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 30.05 ms [22:36:37] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:44] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:40] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:58] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh categories journal to resolve categories update lag unknown alert status" --blazegraph_instance categories --without-lvs` on `ryankemper@cumin1001` tmux session `wdqs` [22:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:04] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:38:51] (03PS1) 10Hashar: gerrit: listen on all address with iptables rule [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) [22:40:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:49] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:41:52] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1009.eqiad.wmnet --reason "transferring fresh categories journal to resolve categories update lag unknown alert status" --blazegraph_instance categories --without-lvs` on `ryankemper@cumin1001` tmux session `wdqs` [22:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:02] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [22:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:14] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) 05Open→03Resolved Still needs a new spicerack release, but hopefully finally fixed now :) [22:49:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) @RobH Sorry for delay cable was not seated completely in switch. fixed has link now [22:49:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) [22:52:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) @cmooney @ayounsi Dac cables arrived today finished install with dac and updated netbox with console ports to scs please let me know if anything else is needed [22:56:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) [23:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210721T2300). Please do the needful. [23:00:05] legoktm: A patch you scheduled for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:08] 10SRE, 10ops-eqiad: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10wiki_willy) a:03Jclark-ctr [23:01:19] 10SRE, 10ops-codfw: decommission procyon - https://phabricator.wikimedia.org/T287114 (10Papaul) [23:02:54] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:06:08] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-codfw:fpc0 crash - https://phabricator.wikimedia.org/T287110 (10Papaul) Email from JTAC ` Please perform a physical re-seat of the card. Remove it and insert it back into the chassis. If this doesn’t work, we’ll proceed with a replacement. ` [23:11:49] (03PS3) 10Legoktm: Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) [23:12:07] (03CR) 10Legoktm: [C: 03+2] Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [23:12:28] (03PS2) 10Legoktm: Enable Score via Shellbox on enwikisource and plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706020 (https://phabricator.wikimedia.org/T257066) [23:12:33] (03CR) 10Legoktm: [C: 03+2] Enable Score via Shellbox on enwikisource and plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706020 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [23:12:54] (03Merged) 10jenkins-bot: Uninstall Score on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704149 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [23:13:18] (03Merged) 10jenkins-bot: Enable Score via Shellbox on enwikisource and plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/706020 (https://phabricator.wikimedia.org/T257066) (owner: 10Legoktm) [23:15:43] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable Score on enwikisource, plwikisource. Disable on all private/lockeddown wikis (T257066) (duration: 01m 03s) [23:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:50] T257066: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 [23:19:37] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 4 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Legoktm) @ankry @Inductiveload it's enabled now. Please don't immediately go and mass-purge every page using... [23:25:36] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2043.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2043.codfw.wmnet'] `