[00:01:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:06:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:19:24] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:26] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:02] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7ff386d1a280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:02] org/wiki/Search%23Administration [00:34:12] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:38] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1504, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:35:38] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:35:46] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:23:04] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:23:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:25:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:30:38] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:20] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:32:30] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:37:46] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:54] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:53:40] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:57:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:24] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:17:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:44] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:35:08] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:35:54] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:48:26] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:49:14] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:03:36] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:05:04] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:26:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:31:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:13:44] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [04:15:16] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [04:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:10:10] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:52:02] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:58:22] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:03:10] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [07:04:46] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:37:20] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:37:46] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:38:24] PROBLEM - Host cp5024 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:26] PROBLEM - Host ganeti5007 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:34] RECOVERY - Host cp5024 is UP: PING OK - Packet loss = 0%, RTA = 260.96 ms [07:38:48] RECOVERY - Host ganeti5007 is UP: PING WARNING - Packet loss = 66%, RTA = 248.55 ms [07:39:00] PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:39:00] PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [07:39:12] RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 256.94 ms [07:39:18] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:03] (ProbeDown) firing: (9) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:08] RECOVERY - Host cr2-eqsin.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 283.49 ms [07:40:11] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:40:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:40:36] PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [07:40:52] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [07:40:58] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:09] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:41:18] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:36] PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [07:41:44] PROBLEM - SSH on cp5032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:41:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:42:00] PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100% [07:42:18] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.000672 secs https://wikitech.wikimedia.org/wiki/NTP [07:42:26] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmne [07:42:26] 2.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:42:28] PROBLEM - SSH on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:42:32] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmne [07:42:32] 7.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:42:36] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10AlexisJazz) I just received one trying to look up https://en.wiktionary.org/wiki/stuff. Not reproducible it seems. [07:42:42] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:42:58] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:43:00] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:43:02] RECOVERY - Host 2001:df2:e500:1:103:102:166:8 is UP: PING OK - Packet loss = 0%, RTA = 231.44 ms [07:43:10] RECOVERY - SSH on cp5032 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:43:38] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:43:38] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:43:38] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:43:58] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:44:04] PROBLEM - Host cp5022 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:04] PROBLEM - SSH on cp5020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:08] PROBLEM - SSH on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:44:10] PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:16] RECOVERY - Host cp5022 is UP: PING WARNING - Packet loss = 66%, RTA = 266.37 ms [07:44:25] (ProbeDown) firing: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:52] RECOVERY - Host doh5002 is UP: PING WARNING - Packet loss = 90%, RTA = 251.45 ms [07:44:54] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:45:12] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:14] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:45:24] (ProbeDown) firing: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:45:28] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:45:28] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5027 is OK: HTTP OK: HTTP/1.1 200 Ok - 48325 bytes in 2.778 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:45:34] RECOVERY - SSH on cp5020 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:45:36] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:45:38] RECOVERY - SSH on cp5031 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:45:40] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:45:42] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:45:46] PROBLEM - Wikidough DoT Check -IPv4- on doh5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [07:46:16] PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [07:46:16] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [07:46:16] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [07:46:38] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:46:46] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5019 is OK: HTTP OK: HTTP/1.0 200 OK - 37182 bytes in 4.821 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:46:46] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.0 200 OK - 37131 bytes in 4.857 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:46:48] PROBLEM - check_trafficserver_backend_config_status on cp5028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:46:50] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5031 is OK: HTTP OK: HTTP/1.0 200 OK - 36937 bytes in 8.060 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:46:52] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:47:04] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:47:04] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5026 is OK: HTTP OK: HTTP/1.1 200 Ok - 48330 bytes in 1.838 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:47:22] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:47:22] RECOVERY - SSH on cp5024 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:47:22] RECOVERY - Wikidough DoT Check -IPv4- on doh5001 is OK: TCP OK - 8.537 second response time on 103.102.166.14 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [07:47:24] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:46] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:50] RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [07:47:50] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [07:47:50] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [07:48:16] RECOVERY - check_trafficserver_backend_config_status on cp5028 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:48:18] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [07:48:20] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5030 is OK: HTTP OK: HTTP/1.0 200 OK - 36938 bytes in 1.401 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:48:20] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5026 is OK: HTTP OK: HTTP/1.0 200 OK - 36926 bytes in 1.648 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [07:48:20] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5025 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.497 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:48:30] RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset 0.001964 secs https://wikitech.wikimedia.org/wiki/NTP [07:48:54] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [07:49:00] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:49:14] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:24] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:49:24] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:36] RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [07:49:44] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5017 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.508 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:49:54] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:20] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.000672 secs https://wikitech.wikimedia.org/wiki/NTP [07:50:20] (ProbeDown) resolved: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:14] (ProbeDown) resolved: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:15] (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:55:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:55:36] 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10MBinder_WMF) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230108T0800) [08:06:31] ugh 3 tasks about network connectivity issues and wikimediastatus does show a spike in connectivity issues [08:09:49] 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10Peachey88) Have these issues been happening for a while or only recently in the... [08:13:06] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [08:14:36] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:51:03] p858snake: looks like eqsin lost connectivity for a bit [08:51:26] It should have pages [08:51:29] Paged [08:51:52] But I’m not seeing anything in klaxon [08:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:53:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:53:37] _joe_: you seem to have been active a few minutes, is anyone aware of the issues [08:54:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:55:07] 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) [08:55:28] 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) [08:56:04] 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) p:05Triage→03Unbreak! There’s a marked spike in errors and a flurr... [08:56:11] 10SRE, 10Performance-Team, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) [09:00:45] 10SRE, 10Performance-Team, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) > 07:37:20 PROBLEM - BGP status on... [09:01:13] 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) [09:01:22] <_joe_> RhinosF1: that task is most likely unrelated, please stop adding information to it [09:01:41] <_joe_> it's about ongoing issues the specific person is having that I suspect have nothing to do with our infrastructure [09:01:43] Stopping [09:02:04] <_joe_> the time correlation did fool me too at first, btw [09:02:41] _joe_: slow to respond might be, I think that can be split off and yes it being at the same time and initially sounding like a connection error happened is confusing [09:05:12] 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10Joe) p:05Unbreak!→03Low This task reports continuing issues for... [09:15:05] 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: zhwiki met a problem with access - https://phabricator.wikimedia.org/T326495 (10RhinosF1) 05duplicate→03Resolved p:05Triage→03Unbreak! This was resolved, SRE are aware of an issue around this time. The previously linked task was unrelated. [09:15:28] 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: zhwiki met a problem with access - https://phabricator.wikimedia.org/T326495 (10RhinosF1) >>! In T326495#8507147, @SD_hehua wrote: > Emmm not a duplicate? Yep, 2 issues with slowness around the same time. Wasn’t connected though. [09:16:00] 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: 2023-01-08 Wikimedia Connectivity Issues - https://phabricator.wikimedia.org/T326495 (10RhinosF1) [09:17:37] 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10RhinosF1) >>! In T301505#8507048, @AlexisJazz wrote: > I just received one trying to look up https://en.wiktionary.org/wiki/stuff. Not reproducible it se... [09:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:27:28] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:22] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:55:46] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:03:48] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:05:18] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:49:10] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:06:42] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:09:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:52:43] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [17:23:52] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:39:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:44:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:40] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:35:36] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:35:50] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:35:54] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:36:22] PROBLEM - Host 2001:df2:e500:1:103:102:166:10 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:32] RECOVERY - Host 2001:df2:e500:1:103:102:166:10 is UP: PING OK - Packet loss = 0%, RTA = 241.04 ms [20:36:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:36:38] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:43] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:47] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:37:08] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:37:08] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:37:08] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:37:08] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:37:14] PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:37:22] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:37:22] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:37:24] PROBLEM - SSH on ganeti5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:37:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:37:28] PROBLEM - SSH on prometheus5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:37:28] PROBLEM - SSH on ncredir5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:37:30] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne [20:37:30] 1.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:37:30] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmne [20:37:30] 7.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:37:30] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:37:41] PROBLEM - Host cr3-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100% [20:37:42] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [20:37:58] PROBLEM - SSH on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:28] PROBLEM - SSH on lvs5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:28] PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [20:38:28] PROBLEM - SSH on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:28] PROBLEM - SSH on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:32] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5025 is OK: HTTP OK: HTTP/1.1 200 Ok - 48303 bytes in 2.047 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:38:32] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5031 is OK: HTTP OK: HTTP/1.1 200 Ok - 48329 bytes in 2.873 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:38:34] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.1 200 Ok - 48662 bytes in 4.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:38:44] RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:38:48] PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:48] RECOVERY - SSH on ganeti5007 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:52] RECOVERY - SSH on prometheus5001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:53] o/ [20:38:58] RECOVERY - SSH on ncredir5001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:39:10] PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:39:14] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:39:30] PROBLEM - SSH on cp5027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:39:33] RECOVERY - Host cr3-eqsin #page is UP: PING WARNING - Packet loss = 66%, RTA = 239.24 ms [20:39:52] RECOVERY - SSH on cp5025 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:39:54] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:39:54] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:39:58] RECOVERY - SSH on cp5026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:40:00] RECOVERY - SSH on lvs5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:40:18] PROBLEM - Wikidough DoT Check -IPv6- on doh5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [20:40:20] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:40:24] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:40:24] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:40:34] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne [20:40:35] 2.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:40:38] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne [20:40:38] 1.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:40] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5028 is OK: HTTP OK: HTTP/1.1 200 Ok - 48347 bytes in 4.855 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:41:42] RECOVERY - Wikidough DoT Check -IPv6- on doh5002 is OK: TCP OK - 1.160 second response time on 2001:df2:e500:1:103:102:166:5 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [20:41:46] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:41:49] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [20:41:52] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5017 is OK: HTTP OK: HTTP/1.1 200 Ok - 48747 bytes in 2.553 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:41:52] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:41:52] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [20:41:58] PROBLEM - SSH on durum5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:42:03] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:42:06] (ProbeDown) firing: (7) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:10] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [20:42:12] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:42:14] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:42:26] RECOVERY - SSH on cp5027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:42:32] PROBLEM - SSH on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:42:36] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:43:08] PROBLEM - Wikidough DoH Check -IPv4- on doh5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [20:43:14] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [20:43:22] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:43:28] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:43:42] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:43:50] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:43:50] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:44:00] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.0 200 OK - 37140 bytes in 2.693 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:44:06] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp5030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.27: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:44:06] PROBLEM - SSH on cp5021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:44:12] PROBLEM - SSH on cp5019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:44:12] PROBLEM - SSH on ganeti5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:44:15] XioNoX topranks: seems the routers in eqsin are unreachable. maintenance related? [20:44:26] RECOVERY - Host cr2-eqsin IPv6 is UP: PING WARNING - Packet loss = 66%, RTA = 266.29 ms [20:44:30] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5017 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 7.691 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:44:32] RECOVERY - Wikidough DoH Check -IPv4- on doh5001 is OK: HTTP OK: HTTP/1.1 200 OK - 550 bytes in 2.116 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [20:44:36] RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.001119 secs https://wikitech.wikimedia.org/wiki/NTP [20:44:46] hi [20:44:46] cwhite: check this morning, 13 hours ago [20:44:50] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:44:54] Similar incident possibly [20:44:54] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:44:54] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:44:54] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:44:54] RECOVERY - SSH on durum5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:10] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:45:16] RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:45:18] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:45:20] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5018 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 4.998 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:45:22] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:45:22] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:45:24] RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp5030 is OK: OK: read 8 bytes as expected https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:45:26] RECOVERY - SSH on cp5030 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:30] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:45:30] RECOVERY - SSH on cp5021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:30] RECOVERY - SSH on cp5018 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:38] RECOVERY - SSH on cp5019 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:45:38] RECOVERY - SSH on ganeti5006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:46:00] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 260.84 ms [20:46:02] RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset 0.001702 secs https://wikitech.wikimedia.org/wiki/NTP [20:46:04] I ACKed the last remaining page [20:46:10] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5023 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:46:30] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:32] * akosiaris around [20:46:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:46:52] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:46:52] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5024 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.476 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:46:52] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5030 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.501 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:46:54] * jhathaway around as well [20:47:08] 15:34:40 <+icinga-wm> PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:47:13] this is the first one [20:47:29] akosiaris, jhathaway: see -sre too [20:47:54] RhinosF1: thanks [20:47:55] so second incident with eqsin today? [20:47:59] hey [20:48:10] hi XioNoX [20:48:22] Looks that way, no idea beyond time and similar alerts on the first though [20:48:25] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [20:48:26] looking [20:48:29] (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:34] (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:31] yeah, something's not right with the transport links [20:49:44] one has been down for a while, and maybe the other one has been miss-behsving [20:49:45] (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:50:08] ok, that explains some of the other Traffic alerts we are getting and seems to match those then [20:50:32] XioNoX: cr3 was alerting a few times over night. I think around the time of this morning’s incident cr2 alerted too [20:50:38] I saw this on the calendar, but not related to cr3 I assume, "cr2-eqsin <-> cr4-ulsfo SingTelcircuit maintenance" [20:51:13] it could be, it's one of the two eqsin-USA links [20:52:37] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [20:52:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:47:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:52:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown