[00:15:22] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [00:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:41] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) @ayounsi replaced the fiber and cleared interface statistics [00:19:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:25] !log repooling mw1450 (forgot to after benchmarking finished) [00:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [00:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster [00:28:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [00:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [00:28:42] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1004.eqiad.wmnet with OS buster [00:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [00:28:47] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1003.eqiad.wmnet with OS buster [00:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster... [00:35:17] (03CR) 10Ladsgroup: Add MySQL upgrade cookbook (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [00:35:52] (03PS6) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [00:49:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:53] 10SRE, 10SRE-Access-Requests: Google Search Console access request for Andrew Green - https://phabricator.wikimedia.org/T298262 (10AndyRussG) Thanks so so much @Dzahn!!! Hugely appreciated! [00:53:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:23:27] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:33] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:31:15] PROBLEM - Host cp2029 is DOWN: PING CRITICAL - Packet loss = 100% [04:10:16] eh [04:11:14] hm, not one of the special hosts [04:16:09] !log powercycling cp2029 via mgmt [04:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:41] RECOVERY - Host cp2029 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [04:20:12] !log depooled cp2029 now that it's up [04:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:16] 10ops-codfw, 10Traffic: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Legoktm) [07:07:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [07:23:37] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:24:05] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:25:45] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:26:15] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:57:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic2035-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [09:13:57] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:16:07] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:57:05] (03PS1) 10Volans: redfish: improve support for DRY-RUN mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749852 [10:18:33] (03PS1) 10Majavah: hieradata: add drmrs to striker's trusted proxies [puppet] - 10https://gerrit.wikimedia.org/r/749854 [10:32:47] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:55] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:13:46] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T298297 (10Pascal2357) [11:14:10] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T298298 (10Pascal2357) [11:14:47] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T298299 (10Pascal2357) [11:16:02] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T298297 (10RhinosF1) 05Open→03Invalid [11:18:00] ^ handled [11:33:55] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:55] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:04:13] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:47] PROBLEM - MegaRAID on db2147 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:06:48] ACKNOWLEDGEMENT - MegaRAID on db2147 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T298301 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:06:52] 10SRE, 10ops-codfw: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10ops-monitoring-bot) [14:16:38] 10SRE, 10ops-codfw, 10Data-Persistence, 10Dumps-Generation: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10RhinosF1) This is the vslow/dumps host for S4. Was added in February via {T275633} [15:05:15] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:48:43] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:49:45] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:45] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:26:42] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2147 - https://phabricator.wikimedia.org/T298301 (10Marostegui) p:05Triage→03Medium [18:31:29] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:33:41] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:07:53] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:42:47] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:18] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e282d2d]: (no justification provided) [20:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:25] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e282d2d]: (no justification provided) (duration: 00m 06s) [20:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:07] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:54:49] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:13:17] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:11:49] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:57:23] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook