[00:50:41] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:01:11] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10srodlund) 05Open→03Resolved Done! :-) [02:13:05] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:13:45] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:05] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:29:23] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:09] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:45] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:43:09] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-06-09 04:29:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:53:51] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:01] PROBLEM - MegaRAID on db2148 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:38:02] ACKNOWLEDGEMENT - MegaRAID on db2148 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T284852 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:38:06] 10SRE, 10ops-codfw: Degraded RAID on db2148 - https://phabricator.wikimedia.org/T284852 (10ops-monitoring-bot) [05:44:47] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-06-12 04:37:53 (576 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:42:55] RECOVERY - MegaRAID on db2148 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:03:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for WQuarshie - https://phabricator.wikimedia.org/T284832 (10Aklapper) That would be https://www.mediawiki.org/wiki/Phabricator/Help#Creating_your_account / https://phabricator.wikimedia.org/settings/panel/external/ [09:26:29] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:02] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Esc3300) p:05Medium→03Unbreak! Can the patch be reviewed? [09:29:06] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10RhinosF1) p:05Unbreak!→03Medium This isn't UBN! [09:29:33] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Esc3300) Given the time lapsed, it's no longer medium priority. So is it... [09:31:10] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10RhinosF1) >>! In T204830#7152944, @Esc3300 wrote: > Given the time lapse... [10:17:25] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10Volans) [10:18:44] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10Volans) [10:19:45] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10Volans) 05Open→03Resolved a:03Volans @dang I've added you to the `wmde` and `nda` LDAP groups. I'm resolving this request but feel free to re-open it in case you encounter any... [10:23:24] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10Volans) [10:23:55] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10Volans) 05Open→03Resolved a:03Volans @karapayneWMDE I've added you to the `wmde` and `nda` LDAP groups. I'm resolving this request but feel free to re-open it in case you encou... [10:39:25] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Aklapper) @Esc3300: Please see https://www.mediawiki.org/wiki/Bug_manage... [10:41:22] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Esc3300) Can we assign the task to someone? [10:49:12] (03PS1) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [10:50:16] (03CR) 10jerkins-bot: [V: 04-1] Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [10:55:42] (03PS2) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [10:56:44] (03CR) 10jerkins-bot: [V: 04-1] Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [10:58:41] (03Abandoned) 10Majavah: Drop python2 flake8 runs [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697095 (owner: 10Majavah) [10:59:52] (03PS3) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [11:00:54] (03CR) 10jerkins-bot: [V: 04-1] Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [11:04:47] (03PS6) 10MarcoAurelio: enwiki: Remove 'collectionsaveascommunitypage' from the 'autoconfirmed' and 'confirmed' user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698041 (https://phabricator.wikimedia.org/T283523) [11:08:17] RECOVERY - kartotherian endpoints health on maps2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [11:45:34] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10RhinosF1) No because we don't know who on SRE will be around next week. [11:51:59] (03PS1) 10Majavah: Fix prometheus monitoring for Toolforge Ingress [puppet] - 10https://gerrit.wikimedia.org/r/699484 (https://phabricator.wikimedia.org/T284353) [12:07:10] (03PS1) 10Majavah: toolforge:toolviews: Allow disabling toolviews in hiera [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) [12:08:36] (03CR) 10jerkins-bot: [V: 04-1] toolforge:toolviews: Allow disabling toolviews in hiera [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) (owner: 10Majavah) [12:09:27] (03PS2) 10Majavah: toolforge:toolviews: Allow disabling toolviews in hiera [puppet] - 10https://gerrit.wikimedia.org/r/699485 (https://phabricator.wikimedia.org/T284558) [12:10:59] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:34:57] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:37:14] PROBLEM - MariaDB Replica Lag: x2 #page on db1153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 85889.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:38:02] PROBLEM - MariaDB Replica Lag: x2 #page on db1152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 85936.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:38:24] here [13:38:42] here too [13:38:47] here, see -sre [13:38:48] see -sre [13:38:52] just downtime them [13:39:04] on it [13:39:11] kormat: just start pt heartbeat [13:39:13] on the master [13:39:15] thank you [13:39:18] and that should be it [13:39:19] marostegui: i guessed :) [13:39:25] I will be home in 10 though [13:39:27] wait, downtime or heartbeat or both? [13:39:29] I can do it [13:39:35] rzl both :) [13:39:39] 👍 [13:39:40] or I can do it in 10 [13:39:43] rzl: i'm doing the heartbeat, you do the downtime, please [13:39:52] kormat: sgtm [13:39:52] RECOVERY - MariaDB Replica Lag: x2 #page on db1152 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:40:29] marostegui: it's a good thing i have that puppet CR to fix this issue on monday ;) [13:40:45] going back afk but otherwise available [13:40:58] RECOVERY - MariaDB Replica Lag: x2 #page on db1153 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:27] kormat: good timing indeed [13:41:55] marostegui, kormat: downtimed all the MariaDB Replica * services on db1152 for 48h [13:42:24] rzl: ty <3 [13:42:41] marostegui: i double-checked the codfw master, it has heartbeat running [13:42:47] so we should be good, now [13:43:00] i'd like to know why it took ~a day for an alert to fire [13:43:06] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [13:43:17] x2 glitch with no user imact? [13:43:23] jynus: correct. [13:43:26] cool [13:43:37] I have to go if nothing serious, sorr [13:43:41] (03CR) 10Paladox: "For reference I got an email from wmcs for the puppet failure on wikistats-wild-tiger.wikistats.eqiad.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [13:43:45] jynus: np o7 [13:43:51] thanks for taking care [13:45:31] (03CR) 10Paladox: "Note that it was broken anyways hence why that change fixes it. So we were using an unused param which is now used correctly." [puppet] - 10https://gerrit.wikimedia.org/r/699506 (owner: 10Paladox) [13:45:35] filed https://phabricator.wikimedia.org/T284858 to look into the alerting question [13:45:39] thanks folks, we should be good now. [13:45:46] PROBLEM - MariaDB read only x2 #page on db1151 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.19-MariaDB-log, Uptime 86143s, event_scheduler: True, 16.59 QPS, connection latency: 0.003376s, query latency: 0.000442s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:45:54] * volans here [13:45:57] what's up? [13:46:04] orr. not? [13:46:10] didn't downtime that one! adding it [13:46:24] kormat: ^ assuming that's also fine [13:46:31] rzl: yeah go for it [13:46:32] should I just do the whole host? [13:46:36] yeah, please [13:46:38] you can just set global read_only=off [13:46:42] marostegui: i'm going to do that too [13:46:43] on the master [13:46:48] Acked [13:46:48] I'm home already [13:46:49] but i no longer trust the host not to alert for yet another reason [13:46:57] getting on the lift [13:47:12] oh man I didn't even realize the first two alerts were different hosts, I gotta get more sleep somehow [13:47:13] marostegui: need more narrative detail on your journey [13:47:14] To the Batcave? [13:47:17] okay, doing it properly now :) [13:47:38] RECOVERY - MariaDB read only x2 #page on db1151 is OK: Version 10.4.19-MariaDB-log, Uptime 86254s, read_only: False, event_scheduler: True, 29.28 QPS, connection latency: 0.004043s, query latency: 0.000478s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:47:59] kormat: xdddddd [13:49:41] !log rzl@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: alert noise, no impact, x2 is unused [13:49:43] !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: alert noise, no impact, x2 is unused [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:54] I am on my laptop now [13:49:57] rzl: perfect, thank you [13:50:11] marostegui: you are "surplus to requirements" [13:50:48] what's left? should I take a general look at all x2 hosts to make sure I didn't forget anything else? :) [13:50:58] marostegui: you should. _on monday_. [13:51:05] * kormat shoos marostegui away [13:51:30] Thanks for handling my mistakes :(+ [13:51:46] that's literally my entire job description. ;) [14:03:05] * sobanski takes a note to update kormat’s job description in official documents [14:05:06] :D [14:05:55] XD [14:20:39] (03PS1) 10ZAR: eswiki: Add Abuse Filter blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699489 [14:31:23] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:01:14] (03CR) 10BryanDavis: "The `Can't exec "pyversions"` failure is likely because you removed the `--buildsystem=pybuild` argument in d/rules based on the web searc" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [15:05:39] (03CR) 10Majavah: "> Patch Set 3:" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [15:06:27] (03PS4) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [15:07:30] (03CR) 10jerkins-bot: [V: 04-1] Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [15:13:05] (03PS5) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [15:14:15] (03CR) 10jerkins-bot: [V: 04-1] Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [15:30:37] (03CR) 10BryanDavis: "```" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) (owner: 10Majavah) [15:36:27] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:28:55] (03PS6) 10Majavah: Use Python 3 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/699482 (https://phabricator.wikimedia.org/T284586) [16:36:57] (03PS1) 10Aklapper: Phabricator: Disable setting lowest priority on tasks [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) [16:38:18] (03CR) 10Aklapper: [C: 04-1] "Controversial hence -1'ing for the time being" [puppet] - 10https://gerrit.wikimedia.org/r/699493 (https://phabricator.wikimedia.org/T228759) (owner: 10Aklapper) [16:57:12] (03CR) 10Majavah: Route Grid engine web requests via Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [18:07:38] (03PS1) 10Zabe: eswiki AbuseFilter config changes for [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) [18:09:11] (03CR) 10Zabe: [C: 04-1] "dupe of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699494" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699489 (owner: 10ZAR) [18:10:38] (03PS2) 10Zabe: eswiki AbuseFilter config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699494 (https://phabricator.wikimedia.org/T284797) [18:11:41] 10SRE, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Orlodrim) 05Open→03Resolved a:03Orlodrim The last 6 emails passed the spam filter, so I'm closing this bug. It might have been solved by the recent update of mailman. [19:08:32] (03PS1) 10Zabe: Enable wikilove on hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699495 (https://phabricator.wikimedia.org/T284864) [19:35:29] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 761 ge 480 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:37:19] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)480 ge (W)60 ge 16 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [22:07:23] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:01] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook