[00:02:09] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [00:02:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:07:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:23:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:27:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:32:55] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:40:27] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:47:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:50:01] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:24] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:13:03] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:34:37] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 6.939e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:41:43] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:36:45] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.132e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:09:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:14:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:19:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:24:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:33:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [03:43:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [04:02:01] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:13] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:48:23] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [04:50:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:55:37] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [04:55:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:00:38] (03PS2) 10Krinkle: Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [05:01:10] (03CR) 10Krinkle: [C: 03+1] Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [05:10:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:16:07] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:22:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:26:57] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:38:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:46:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:53:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:55:45] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:17:11] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:05] (03PS3) 10Krinkle: Migrate $wmfRealm calls to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [06:38:22] (03CR) 10Krinkle: [C: 03+1] "LGTM. I found no other places that still read wmfRealm (only places that co-write it, which needs to stay for now)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [07:05:51] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value [07:05:51] /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [07:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220206T0800) [08:30:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:35:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:52:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:57:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:32:14] (03PS1) 10Majavah: Ensure GlobalBlocking is not loaded without CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760202 (https://phabricator.wikimedia.org/T299371) [09:59:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:29:24] (03CR) 10Jcrespo: [C: 03+1] "+1 patch looks good based on request, checked same id as cloud, etc.." [puppet] - 10https://gerrit.wikimedia.org/r/759846 (https://phabricator.wikimedia.org/T300878) (owner: 10Ladsgroup) [10:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:56:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:09:17] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:11:39] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:03:57] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:38:23] (03CR) 10Zabe: [C: 03+1] Ensure GlobalBlocking is not loaded without CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760202 (https://phabricator.wikimedia.org/T299371) (owner: 10Majavah) [12:48:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:00:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:09:31] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:11:55] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:18:30] (03Abandoned) 10MarcoAurelio: incubatorwiki: Increase AbuseFilter thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756572 (https://phabricator.wikimedia.org/T299868) (owner: 10MarcoAurelio) [13:26:23] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:31:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:38:21] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:50:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:00:19] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:12:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:26:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:31:35] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:36:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:43:37] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:46:01] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:07:57] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:13:19] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:15:37] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:20:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:29:35] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:38:49] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:41:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:46:19] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:50:49] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:03:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:09:49] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:16:39] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:42:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [16:44:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:51:50] PROBLEM - kubelet operational latencies on kubernetes1014 is CRITICAL: instance=kubernetes1014.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:15:04] PROBLEM - Check systemd state on an-worker1140 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:32] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:26:44] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:28:04] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:28:20] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:29:10] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:14] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:45:32] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [17:48:26] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:40] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:57:26] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [18:04:26] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [18:06:14] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:11:30] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:23:58] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:30:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate [18:30:14] d the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [18:39:42] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:09:54] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:18:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICA [19:18:16] Suggest a source title to use for translation returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [19:20:16] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:27:22] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:31:24] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:32:04] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:33:44] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:39:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate r [19:39:16] the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [19:40:48] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:43:58] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:52:40] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:02:58] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Patch-For-Review, 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10Aklapper) a:05sdkim→03None Removing inactive task assignee. [20:05:22] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) i [20:05:22] AL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [20:11:32] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:12:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:13:36] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:40] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:28:06] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:31:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:35:14] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:36] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:39:56] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:04] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [20:51:46] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:01:14] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:08:22] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:09:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:11:46] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:13:06] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:19:02] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target langua [21:19:02] returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [21:27:02] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:40:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:49:48] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:53:00] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:54:12] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:01:08] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:04:22] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:04:41] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [22:07:46] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:07:56] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:11:16] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:15:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:17:20] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:18:20] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:24:22] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:26:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:35:42] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:40:08] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [22:47:08] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [22:54:54] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:02:02] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:06:48] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:08:20] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:08:30] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:09:14] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:09:30] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:13:56] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:15:00] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:17:24] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:18:00] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:18:48] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:19:04] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:19:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [23:23:28] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:30:38] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:34:38] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:41:58] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX