[00:01:44] PROBLEM - Check systemd state on ml-etcd1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:38] (KubernetesAPILatency) firing: (34) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:07:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:09:16] RECOVERY - Check systemd state on ml-etcd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:58] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:12:40] (KubernetesAPILatency) firing: (26) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:13:56] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:17:38] (KubernetesAPILatency) firing: (30) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:19:00] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:58] PROBLEM - ganeti-noded running on ganeti1022 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [00:21:18] RECOVERY - ganeti-noded running on ganeti1022 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [00:22:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:38] (KubernetesAPILatency) firing: (34) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:28:48] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:30:50] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:33:16] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:37:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:42:18] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:42:38] (KubernetesAPILatency) firing: (29) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:44:48] PROBLEM - ganeti-mond running on ganeti1022 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [00:45:10] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:08] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:16] RECOVERY - ganeti-mond running on ganeti1022 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [00:47:38] (KubernetesAPILatency) firing: (32) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:50:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963971 [00:50:15] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963971 (owner: 10TrainBranchBot) [00:52:38] (KubernetesAPILatency) firing: (29) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:54:12] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [00:57:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:00:02] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:02:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:05:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/963971 (owner: 10TrainBranchBot) [01:06:00] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:07:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:12:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:12:53] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:16:34] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:38] (KubernetesAPILatency) firing: (34) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:19:22] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:22:38] (KubernetesAPILatency) firing: (36) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:27:38] (KubernetesAPILatency) firing: (36) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:31:06] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:32:38] (KubernetesAPILatency) firing: (33) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:37:38] (KubernetesAPILatency) firing: (35) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:39:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:38] (KubernetesAPILatency) firing: (36) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:43:32] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:28] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:47:38] (KubernetesAPILatency) firing: (35) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:51:50] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:52:38] (KubernetesAPILatency) firing: (35) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:57:38] (KubernetesAPILatency) firing: (34) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:57:48] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [01:58:36] PROBLEM - Check systemd state on ml-etcd1003 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:38] (KubernetesAPILatency) firing: (30) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:03:46] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:07:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:08:12] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:12:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:12:42] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:54] RECOVERY - Check systemd state on ml-etcd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:06] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:17:38] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:17:53] (KubernetesAPILatency) firing: (31) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:20:04] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:22:38] (KubernetesAPILatency) firing: (32) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:24:36] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:27:26] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:27:38] (KubernetesAPILatency) firing: (28) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:30:26] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:32:39] (KubernetesAPILatency) firing: (29) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:33:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:38] (KubernetesAPILatency) firing: (32) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:38:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:18] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:42:38] (KubernetesAPILatency) firing: (35) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:43:40] PROBLEM - Etcd cluster health on ml-etcd1003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [02:47:38] (KubernetesAPILatency) firing: (34) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:52:39] (KubernetesAPILatency) firing: (32) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:28] RECOVERY - Etcd cluster health on ml-etcd1003 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [02:56:02] PROBLEM - Check systemd state on ml-serve-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-apiserver-safe-restart.service,kube-publish-sa-cert.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:57:38] (KubernetesAPILatency) firing: (32) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:57:54] PROBLEM - Check systemd state on irc1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:48] PROBLEM - MD RAID on ganeti1022 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 2, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:02:39] (KubernetesAPILatency) firing: (30) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:03:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:39] (KubernetesAPILatency) resolved: (29) High Kubernetes API latency (GET ) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:14:00] RECOVERY - Check systemd state on irc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:26] RECOVERY - Check systemd state on ml-serve-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:29:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:53:54] (03PS1) 10Legoktm: Revert "admin: Temporarily disable legoktm's access" [puppet] - 10https://gerrit.wikimedia.org/r/964174 [04:57:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:59:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:09:11] (03PS1) 10Legoktm: WIP: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231008T0700) [07:04:21] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:29:20] PROBLEM - Query Service HTTP Port on wdqs1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:08:32] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:18:31] (KubernetesAPINotScrapable) firing: k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:23:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:38:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:18] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST challenges) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST challenges) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:18] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:06:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:23:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [18:48:32] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:16:14] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:46] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:45] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:23:17] (PoolcounterFullQueues) firing: (2) Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [22:24:00] (NodeTextfileStale) firing: Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:28:17] (PoolcounterFullQueues) resolved: (2) Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [22:39:00] (NodeTextfileStale) firing: (2) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:48:32] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:52:49] (03CR) 10Tim Starling: [C: 03+1] thumbor: add imagemagick policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [22:57:02] !log [WDQS] Restarted `wdqs1014`; blazegraph has been deadlocked since `2023-10-07 12:30:00` [22:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:06] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:58:23] !log [WDQS] Depooled `wdqs1014` while it catches up on a day of lag [22:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:34] RECOVERY - Query Service HTTP Port on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:02:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:04:00] (NodeTextfileStale) firing: (3) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:13:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:14:00] (NodeTextfileStale) firing: (4) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:14:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:16:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:19:00] (NodeTextfileStale) firing: (6) Stale textfile for cloudvirt2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:23:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:28:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:51:05] 10SRE, 10DBA: Error connecting to db2109 as user wikiadmin2023 - https://phabricator.wikimedia.org/T348419 (10MarcoAurelio) [23:55:33] 10SRE, 10DBA: Error connecting to db2109 as user wikiadmin2023: :real_connect(): (HY000/2002): Connection refused - https://phabricator.wikimedia.org/T348419 (10MarcoAurelio)