[00:06:11] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:24:39] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:04:07] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:06:03] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:22:31] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:24:23] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:55:19] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 126 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:12:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 43 probes of 621 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:35:36] (03CR) 10Zfilipin: "🥳" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/697069 (https://phabricator.wikimedia.org/T274579) (owner: 10Sahilgrewalhere) [08:07:47] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:34:03] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:35:59] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:41:08] (03CR) 10Volans: [C: 03+2] "Thanks for the fix!" [cookbooks] - 10https://gerrit.wikimedia.org/r/705103 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [08:42:53] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:43:54] (03Merged) 10jenkins-bot: Typo fix: "the the" -> "the" [cookbooks] - 10https://gerrit.wikimedia.org/r/705103 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [09:08:39] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:14:37] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:43] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:14:41] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:15:27] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:16:38] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:38:16] (03PS1) 10Ladsgroup: mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) [12:38:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:42:21] (03PS2) 10Ladsgroup: mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) [12:42:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:48:46] (03PS3) 10Ladsgroup: mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) [12:49:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:57:25] (03PS4) 10Ladsgroup: mariadb: Migrate cron of check_private_data to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) [13:03:08] (03CR) 10Ladsgroup: "PCC https://puppet-compiler.wmflabs.org/compiler1001/30242/" [puppet] - 10https://gerrit.wikimedia.org/r/705045 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:09:51] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:11:47] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:13:21] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:21] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:31] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:15:51] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:07] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:22:03] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:22:51] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:15] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:13] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:14:51] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:22:13] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:24:03] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:08:09] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:10:03] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:24:28] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:43:11] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:51:03] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:52:59] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:18:03] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [23:58:59] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems