[00:00:53] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:23:09] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:29:07] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:24:49] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:26:37] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:29:53] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:51] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:05:54] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [04:07:49] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 9 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:51:08] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:56:43] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [06:58:39] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 15 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210718T0700) [07:01:38] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:35:23] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:45:48] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus) [09:30:21] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:13] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:51:59] PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1323 MB (4% inode=95%): /tmp 1323 MB (4% inode=95%): /var/tmp 1323 MB (4% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [10:00:25] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:02:19] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:12:55] RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [11:02:18] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:04:13] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [11:20:18] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_5@production-logstash-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:09] PROBLEM - MariaDB Replica Lag: s2 on db1170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1193.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:53] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:21:45] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:23:28] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:25:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:25:35] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:20:42] (03PS1) 10Majavah: puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184 [14:21:30] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah) [14:24:09] (03PS2) 10Majavah: puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184 [14:34:53] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:47:38] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:11:19] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:15:11] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:37:47] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:17:23] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:17:38] PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:21:03] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:23:21] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:24:13] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:26:01] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:38:38] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:09] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:48:03] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:14:20] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:01] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:13] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [22:45:07] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator