[00:21:39] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (9461) = 76.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:25:05] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:24] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) Sure, if you get me a list of originals failing with `IDAT: invalid distance too far bac... [02:21:45] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.515e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [02:32:57] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01078 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [03:21:53] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 20.82 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [03:23:45] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 3.905 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [03:24:53] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.515e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [03:39:47] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.007781 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [04:27:33] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:45] 10SRE, 10Wikimedia-Mailing-lists: Set up spare lists host in codfw, document failover procedure - https://phabricator.wikimedia.org/T286071 (10Ladsgroup) One complicating factor is search indexes, they are still on the VM but they are massive and rsync would take a while (unless we have a continuous rsync all... [05:22:05] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) @Bstorm this also includes dbproxy1018 and dbproxy1019 which are the clouddb* proxies [05:24:31] 10SRE, 10Infrastructure-Foundations, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [05:46:33] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.515e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [06:09:13] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01496 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [07:45:03] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:16:15] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.515e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:27:33] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:36:57] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.515e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:45:47] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:48:17] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01585 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [08:53:06] !log patching postorius and mailmanclient on lists1001 for T283659 [08:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] T283659: mailman3: unsubscription request is held for moderation but there is no way to approve it via postorius - https://phabricator.wikimedia.org/T283659 [08:55:09] "bash: patch: command not found" [08:55:12] o.O [08:58:31] git is installed, how patch is not installed :((( [09:11:59] !log restarting mailman3-web on lists1001 to pick up patches for T283659 [09:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:08] T283659: mailman3: unsubscription request is held for moderation but there is no way to approve it via postorius - https://phabricator.wikimedia.org/T283659 [09:18:30] Another restart [10:32:49] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:34:31] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:21:47] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:23:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:27] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:23:33] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:23:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:24:05] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:25:03] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:19] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:45] PROBLEM - IPMI Sensor Status on cp5015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:42:59] PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:44:59] PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:47:21] PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:48:35] PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:52:18] PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:52:26] PROBLEM - IPMI Sensor Status on ganeti5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:52:50] PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:53:26] PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:53:50] PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:54:56] PROBLEM - IPMI Sensor Status on cp5013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:56:04] PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:56:24] PROBLEM - IPMI Sensor Status on ganeti5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:57:20] 10SRE, 10ops-eqsin: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10RhinosF1) [14:57:37] 10SRE, 10ops-eqsin: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10RhinosF1) p:05Triage→03High [14:57:46] PROBLEM - IPMI Sensor Status on ganeti5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:59:00] PROBLEM - IPMI Sensor Status on dns5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:00:01] PROBLEM - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:02:16] PROBLEM - IPMI Sensor Status on cp5016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:02:55] 10SRE, 10ops-eqsin, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10MediaJS) [15:03:04] PROBLEM - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:34] PROBLEM - IPMI Sensor Status on cp5012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:06:10] PROBLEM - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:06] PROBLEM - IPMI Sensor Status on cp5014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:22] PROBLEM - IPMI Sensor Status on lvs5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:10:08] PROBLEM - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:13:18] PROBLEM - IPMI Sensor Status on dns5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:37:04] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:53:48] PROBLEM - DNS on cp5013.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.116 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:11:23] 10SRE, 10ops-eqsin, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) Thanks for the ping, this seems to be a single rack problem - https://netbox.wikimedia.org/dcim/racks/77/ (rack 603) [17:11:32] seems to be rack 603 in eqsin, PS failure [17:11:36] I can reach the nodes [17:11:55] elukey: thanks for looking [17:13:49] :) in theory we should have any impact right now, all the alarms should be related to having only one power supply available rather than two [17:14:04] *shouldn't [17:15:54] not understanding the failure about asw1-eqsin, that is reported down [17:16:24] PROBLEM - DNS on cp5016.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.119 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:33] XioNoX: around? [17:20:25] elukey: what's up? [17:21:00] looks like mgmt router is down so mgmt network is unreachable [17:21:26] PROBLEM - DNS on ganeti5001.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.113 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:21:47] it seems a PS redundancy failure for one rack, but I didn't get the asw1-eqsin down alert [17:22:05] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [17:22:07] UTC: SATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00 for the Equinix maintenance [17:22:40] ah right before the PS failure, lovely timing [17:22:40] okok [17:22:50] elukey: probably because we lost asw1 mgmt before it could alert [17:22:52] checking the CR [17:23:18] @cr3-eqsin> show system alarms [17:23:18] 2021-07-03 14:19:26 UTC Major PEM 1 Not Powered [17:23:36] I asked to Traffic to check later on, Valentin should be home later and will verify if we have to do anything or not for the PS failure (in theory no but not sure) [17:23:55] can you translate? :D [17:24:03] elukey: unless Equinix unplugs the wrong power feed we should be good [17:24:20] elukey: see "SERVICE IMPACTING MAINTENANCE Scheduled Customer Outage in 1 hour: Shutdown Maintenance of PDU, ACB and LV Switchboard at L6 A2 at SG3 [5-206136314890]" [17:24:33] basically equinix is doing power maintenance on the power feeds [17:25:01] ah okok now I get it, I didn't think about checking these things in the mailing list [17:25:09] going to update https://phabricator.wikimedia.org/T286113 [17:25:16] after they're done we will need to check that everything came back up too [17:25:29] elukey: I think Jaime and Riccardo send an email too [17:26:04] once, a PSU died, and the day they did the other feed, the device went full down [17:26:20] that would ssssuuuuuuccckkkk [17:26:34] given this is the start of the long holiday for most folks :-/ [17:26:55] 10SRE, 10ops-eqsin, 10Traffic, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) ` 19:17 XioNoX: around? 19:20 elukey: what's up? 19:21 looks like mgmt router is down so mgmt network... [17:27:24] apergos: https://phabricator.wikimedia.org/T206861#4664474 [17:27:47] half a PDU stayed offline [17:28:31] I do not remember this at all. huh [17:28:56] sometimes a crap memory is a blessing... [17:29:51] alright I have to go to a BBQ but will bring laptop with me [17:30:04] XioNoX: thanks a lot for the context [17:30:08] have a good one :) [17:31:21] 10SRE, 10ops-eqsin, 10Traffic, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) Info about a similar use case (credits to Arzhel): https://phabricator.wikimedia.org/T206861#4664474 Things to decide: 1) Do we n... [17:31:23] thanks for being on top of it! [17:36:21] (03PS1) 10Elukey: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/703031 (https://phabricator.wikimedia.org/T286113) [17:36:31] not sure if this is needed, but created in case --^ [17:36:42] we also have eqiad depooled [17:38:32] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:03] 10SRE, 10ops-eqsin, 10Traffic, 10Patch-For-Review, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) >>! In T286113#7195875, @elukey wrote: > Thanks for the ping, this seems to be a single rack problem - https:... [17:40:15] ok it seems that both eqsin racks are affected by the PS problem [17:41:03] anybody around for a brainbounce about https://gerrit.wikimedia.org/r/c/operations/dns/+/703031 ? [17:41:09] otherwise I need to page people in [17:43:37] elukey: +1 let's depool it for now [17:44:17] XioNoX: eqsin is already depooled, can you check the cr above? [17:44:23] err eqiad [17:44:46] we only have a loss of redundancy, no user impact [17:44:59] but depool would put us in a better spot [17:45:16] okok, if you could review/+1 I'll merge and deploy [17:45:26] (03CR) 10Ayounsi: [C: 03+1] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/703031 (https://phabricator.wikimedia.org/T286113) (owner: 10Elukey) [17:45:36] done [17:45:38] thanks [17:45:46] thank you :) [17:46:19] !log depool eqsin due to loss of power redundancy (equinix maintenance) - T286113 [17:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:27] T286113: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 [17:46:36] (03CR) 10Elukey: [C: 03+2] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/703031 (https://phabricator.wikimedia.org/T286113) (owner: 10Elukey) [17:48:19] ok change deployed to all dns nodes [17:48:31] we should see traffic draining gently [17:49:26] PROBLEM - DNS on cp5015.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.118 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:50:18] PROBLEM - DNS on ganeti5003.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.115 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:50:27] maintenace window is UTC:SATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00 [17:50:38] so it may last for a bit [17:51:41] but they are saying one hour shutdown [17:52:12] PROBLEM - DNS on cp5014.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.117 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:54:19] 10SRE, 10ops-eqsin, 10Traffic, 10Patch-For-Review, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) Me and Arzhel decided to depool eqsin, the PS redundancy failure's maintenance window seems to be: ` UTC: S... [17:56:36] ok I'll check later on, need to go away for dinner, Valentin should also double check when he gets home [17:56:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 45.42 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:57:20] this is expected --^ [18:00:28] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:28] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:11:55] traffic to uslfo increased a lot https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-12h&to=now&var-cluster=All&var-site=ulsfo&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [18:13:55] that's expected :) [18:14:28] yep yep [18:14:58] it grew a lot, I was only checking if it is sustainable [18:15:20] we now have eqiad and eqsin not pooled [18:16:11] vgutierrez: I think that we could keep the situation monitored and then see tomorrow if we can repool (or later on during the US time in case) [18:16:15] what do you think? [18:17:37] +1 [18:17:53] maintenance window closes in less than 4 hours [18:18:56] vgutierrez: perfect, thanks for checking, ttl :) [19:01:06] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:03:50] 10SRE, 10Wikimedia-Mailing-lists: Make auditing members of mailing lists bound to a user right easier - https://phabricator.wikimedia.org/T286122 (10Quiddity) Another complexity: * There are //probably// going to be (many?) instances where someone uses different email addresses for wiki-user and mailing-lists [19:09:27] 10SRE, 10Wikimedia-Mailing-lists: Make auditing members of mailing lists bound to a user right easier - https://phabricator.wikimedia.org/T286122 (10Ladsgroup) Indeed but I hope at least with having an audit we can have a list to manually check [20:02:20] PROBLEM - Thanos compact has high percentage of failures on alert1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:05:54] RECOVERY - Thanos compact has high percentage of failures on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [20:54:30] a critical alert due to Thanos does not seem too reassuring [20:54:36] should we call the Avengers ? :) [20:54:47] Lol [20:55:31] https://xkcd.com/705/ [20:56:03] oh, indeed [20:56:11] we can call a sysadmin instead :D [20:56:26] I really like that one [20:57:51] That might become my favourite [20:58:29] there's also a bigger image [20:58:35] a t-shirt i think? [20:59:18] Fun thing about sysadmins are they'll do anything to keep their stuff working but no one ever knows [20:59:59] if everything works, why do you need the sysadmins? [21:00:25] and if things fails, what are they good for ? [21:00:55] really hard to get recognition :/ [21:15:00] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:15:36] RECOVERY - IPMI Sensor Status on cp5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:16:00] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:17:30] RECOVERY - IPMI Sensor Status on cp5014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:17:48] RECOVERY - IPMI Sensor Status on lvs5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:19:36] RECOVERY - IPMI Sensor Status on cp5008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:19:54] power seems to be back :) [21:20:58] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 223.14 ms [21:21:12] RECOVERY - IPMI Sensor Status on cp5015 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:21:16] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.40 ms [21:21:58] RECOVERY - IPMI Sensor Status on dns5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:22:12] RECOVERY - IPMI Sensor Status on lvs5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:22:58] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:23:14] See also: https://craphound.com/overclocked/Cory_Doctorow_-_Overclocked_-_When_Sysadmins_Ruled_the_Earth.html for a fun short story in the same vein as xkcd #705 [21:23:26] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 240.45 ms [21:24:14] RECOVERY - IPMI Sensor Status on cp5009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:26:44] RECOVERY - IPMI Sensor Status on cp5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:27:54] RECOVERY - IPMI Sensor Status on cp5007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:31:30] RECOVERY - IPMI Sensor Status on cp5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:31:42] RECOVERY - IPMI Sensor Status on ganeti5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:32:24] RECOVERY - IPMI Sensor Status on cp5004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:33:10] RECOVERY - IPMI Sensor Status on cp5010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:33:38] RECOVERY - IPMI Sensor Status on lvs5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:34:40] RECOVERY - IPMI Sensor Status on cp5013 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:35:48] RECOVERY - IPMI Sensor Status on cp5006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:36:10] RECOVERY - IPMI Sensor Status on ganeti5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:36:12] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:37:30] RECOVERY - IPMI Sensor Status on ganeti5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:38:54] RECOVERY - IPMI Sensor Status on dns5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:39:54] RECOVERY - IPMI Sensor Status on cp5011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:40:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:42:10] RECOVERY - IPMI Sensor Status on cp5016 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:42:56] RECOVERY - IPMI Sensor Status on cp5005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:44:40] RECOVERY - IPMI Sensor Status on cp5012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:52:06] RECOVERY - DNS on cp5015.mgmt is OK: DNS OK: 0.024 seconds response time. cp5015.mgmt.eqsin.wmnet returns 10.132.129.118 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:38] RECOVERY - DNS on ganeti5003.mgmt is OK: DNS OK: 0.020 seconds response time. ganeti5003.mgmt.eqsin.wmnet returns 10.132.129.115 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:54:34] RECOVERY - DNS on cp5014.mgmt is OK: DNS OK: 0.011 seconds response time. cp5014.mgmt.eqsin.wmnet returns 10.132.129.117 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:20] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:56:24] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:57:02] RECOVERY - DNS on cp5013.mgmt is OK: DNS OK: 0.016 seconds response time. cp5013.mgmt.eqsin.wmnet returns 10.132.129.116 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:20:24] RECOVERY - DNS on cp5016.mgmt is OK: DNS OK: 0.019 seconds response time. cp5016.mgmt.eqsin.wmnet returns 10.132.129.119 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:24:48] RECOVERY - DNS on ganeti5001.mgmt is OK: DNS OK: 0.011 seconds response time. ganeti5001.mgmt.eqsin.wmnet returns 10.132.129.113 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook