[00:00:53] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[01:23:09] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:29:07] <icinga-wm>	 PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:24:49] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:26:37] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:29:53] <icinga-wm>	 RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:24:51] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:05:54] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[04:07:49] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 9 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[06:51:08] <icinga-wm>	 RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:56:43] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[06:58:39] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 15 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210718T0700)
[07:01:38] <icinga-wm>	 PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[08:35:23] <icinga-wm>	 PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:45:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/704875 (owner: 10RLazarus)
[09:30:21] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:36:13] <icinga-wm>	 RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:51:59] <icinga-wm>	 PROBLEM - Disk space on elastic1039 is CRITICAL: DISK CRITICAL - free space: / 1323 MB (4% inode=95%): /tmp 1323 MB (4% inode=95%): /var/tmp 1323 MB (4% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops
[10:00:25] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:02:19] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:12:55] <icinga-wm>	 RECOVERY - Disk space on elastic1039 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops
[11:02:18] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[11:04:13] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[11:20:18] <icinga-wm>	 PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_5@production-logstash-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:42:09] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on db1170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1193.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:55:53] <icinga-wm>	 PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:21:45] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:23:28] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:25:23] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 13 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[12:25:35] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:20:42] <wikibugs>	 (03PS1) 10Majavah: puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184
[14:21:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184 (owner: 10Majavah)
[14:24:09] <wikibugs>	 (03PS2) 10Majavah: puppetmaster: Collect prometheus metrics about git-sync-upstream [puppet] - 10https://gerrit.wikimedia.org/r/705184
[14:34:53] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:47:38] <icinga-wm>	 PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:11:19] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:15:11] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:37:47] <icinga-wm>	 PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:17:23] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:17:38] <icinga-wm>	 PROBLEM - etcd request latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:21:03] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:21] <icinga-wm>	 PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:24:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:26:01] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:38:38] <icinga-wm>	 RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:09] <icinga-wm>	 PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 6533: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:48:03] <icinga-wm>	 RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:14:20] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:16:01] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:43:13] <icinga-wm>	 PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[22:45:07] <icinga-wm>	 RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator