[00:01:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:06:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:19:24] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:26] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:02] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7ff386d1a280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[00:34:02] <icinga-wm>	 org/wiki/Search%23Administration
[00:34:12] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:38] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1504, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[00:35:38] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:35:46] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:43] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[01:23:04] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:23:52] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:25:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:30:38] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:20] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:40] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:32:30] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:37:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:46] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:54] <icinga-wm>	 PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:53:40] <icinga-wm>	 PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:57:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:24] <icinga-wm>	 RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:17:46] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:44] <icinga-wm>	 RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:35:08] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:35:54] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:48:26] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:49:14] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[03:03:36] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:05:04] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[03:26:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:31:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:13:44] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[04:15:16] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[04:52:43] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[05:10:10] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:52:02] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:58:22] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:03:10] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[07:04:46] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[07:37:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:37:46] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:38:24] <icinga-wm>	 PROBLEM - Host cp5024 is DOWN: PING CRITICAL - Packet loss = 100%
[07:38:26] <icinga-wm>	 PROBLEM - Host ganeti5007 is DOWN: PING CRITICAL - Packet loss = 100%
[07:38:34] <icinga-wm>	 RECOVERY - Host cp5024 is UP: PING OK - Packet loss = 0%, RTA = 260.96 ms
[07:38:48] <icinga-wm>	 RECOVERY - Host ganeti5007 is UP: PING WARNING - Packet loss = 66%, RTA = 248.55 ms
[07:39:00] <icinga-wm>	 PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:39:00] <icinga-wm>	 PROBLEM - Host ps1-603-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[07:39:12] <icinga-wm>	 RECOVERY - Host ps1-603-eqsin is UP: PING OK - Packet loss = 0%, RTA = 256.94 ms
[07:39:18] <jinxer-wm>	 (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:40:03] <jinxer-wm>	 (ProbeDown) firing: (9) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:40:08] <icinga-wm>	 RECOVERY - Host cr2-eqsin.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 283.49 ms
[07:40:11] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[07:40:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[07:40:36] <icinga-wm>	 PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[07:40:52] <icinga-wm>	 PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[07:40:58] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:41:09] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[07:41:18] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:41:36] <icinga-wm>	 PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[07:41:44] <icinga-wm>	 PROBLEM - SSH on cp5032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:41:54] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[07:42:00] <icinga-wm>	 PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:18] <icinga-wm>	 RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.000672 secs https://wikitech.wikimedia.org/wiki/NTP
[07:42:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmne
[07:42:26] <icinga-wm>	 2.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5018.eqsin.wmnet, cp5023.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:42:28] <icinga-wm>	 PROBLEM - SSH on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:42:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmne
[07:42:32] <icinga-wm>	 7.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5023.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:42:36] <wikibugs>	 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10AlexisJazz) I just received one trying to look up https://en.wiktionary.org/wiki/stuff. Not reproducible it seems.
[07:42:42] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[07:42:58] <icinga-wm>	 PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook
[07:43:00] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:43:02] <icinga-wm>	 RECOVERY - Host 2001:df2:e500:1:103:102:166:8 is UP: PING OK - Packet loss = 0%, RTA = 231.44 ms
[07:43:10] <icinga-wm>	 RECOVERY - SSH on cp5032 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:43:38] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:43:38] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:43:38] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:43:58] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:44:04] <icinga-wm>	 PROBLEM - Host cp5022 is DOWN: PING CRITICAL - Packet loss = 100%
[07:44:04] <icinga-wm>	 PROBLEM - SSH on cp5020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:44:08] <icinga-wm>	 PROBLEM - SSH on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:44:10] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:44:10] <icinga-wm>	 PROBLEM - Host doh5002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:44:16] <icinga-wm>	 RECOVERY - Host cp5022 is UP: PING WARNING - Packet loss = 66%, RTA = 266.37 ms
[07:44:25] <jinxer-wm>	 (ProbeDown) firing: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:44:52] <icinga-wm>	 RECOVERY - Host doh5002 is UP: PING WARNING - Packet loss = 90%, RTA = 251.45 ms
[07:44:54] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[07:45:12] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:45:14] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:45:24] <jinxer-wm>	 (ProbeDown) firing: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:45:28] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[07:45:28] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5027 is OK: HTTP OK: HTTP/1.1 200 Ok - 48325 bytes in 2.778 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:45:34] <icinga-wm>	 RECOVERY - SSH on cp5020 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:45:36] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:45:38] <icinga-wm>	 RECOVERY - SSH on cp5031 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:45:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:45:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:45:46] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv4- on doh5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[07:46:16] <icinga-wm>	 PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[07:46:16] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[07:46:16] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[07:46:38] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:46:46] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5019 is OK: HTTP OK: HTTP/1.0 200 OK - 37182 bytes in 4.821 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:46:46] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.0 200 OK - 37131 bytes in 4.857 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:46:48] <icinga-wm>	 PROBLEM - check_trafficserver_backend_config_status on cp5028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.25: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:46:50] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5031 is OK: HTTP OK: HTTP/1.0 200 OK - 36937 bytes in 8.060 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:46:52] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:47:04] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[07:47:04] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5026 is OK: HTTP OK: HTTP/1.1 200 Ok - 48330 bytes in 1.838 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:47:22] <icinga-wm>	 RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[07:47:22] <icinga-wm>	 RECOVERY - SSH on cp5024 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:47:22] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv4- on doh5001 is OK: TCP OK - 8.537 second response time on 103.102.166.14 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[07:47:24] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:47:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:47:50] <icinga-wm>	 RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[07:47:50] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[07:47:50] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[07:48:16] <icinga-wm>	 RECOVERY - check_trafficserver_backend_config_status on cp5028 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:48:18] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[07:48:20] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5030 is OK: HTTP OK: HTTP/1.0 200 OK - 36938 bytes in 1.401 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:48:20] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5026 is OK: HTTP OK: HTTP/1.0 200 OK - 36926 bytes in 1.648 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[07:48:20] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5025 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.497 second response time https://wikitech.wikimedia.org/wiki/Varnish
[07:48:30] <icinga-wm>	 RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset 0.001964 secs https://wikitech.wikimedia.org/wiki/NTP
[07:48:54] <icinga-wm>	 PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[07:49:00] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:49:14] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:49:24] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:49:24] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:36] <icinga-wm>	 RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[07:49:44] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5017 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.508 second response time https://wikitech.wikimedia.org/wiki/Varnish
[07:49:54] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:50:20] <icinga-wm>	 RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.000672 secs https://wikitech.wikimedia.org/wiki/NTP
[07:50:20] <jinxer-wm>	 (ProbeDown) resolved: (5) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:51:14] <jinxer-wm>	 (ProbeDown) resolved: (10) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:54:15] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:55:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:55:36] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10MBinder_WMF)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230108T0800)
[08:06:31] <p858snake>	 ugh 3 tasks about network connectivity issues and wikimediastatus does show a spike in connectivity issues
[08:09:49] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10Peachey88) Have these issues been happening for a while or only recently in the...
[08:13:06] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[08:14:36] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[08:51:03] <RhinosF1>	 p858snake: looks like eqsin lost connectivity for a bit
[08:51:26] <RhinosF1>	 It should have pages
[08:51:29] <RhinosF1>	 Paged
[08:51:52] <RhinosF1>	 But I’m not seeing anything in klaxon
[08:52:43] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:53:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:53:37] <RhinosF1>	 _joe_: you seem to have been active a few minutes, is anyone aware of the issues
[08:54:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:55:07] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1)
[08:55:28] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1)
[08:56:04] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) p:05Triage→03Unbreak! There’s a marked spike in errors and a flurr...
[08:56:11] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1)
[09:00:45] <wikibugs>	 10SRE, 10Performance-Team, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1) > 07:37:20 <icinga-wm> PROBLEM - BGP status on...
[09:01:13] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue, 10Wikimedia-Incident: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10RhinosF1)
[09:01:22] <_joe_>	 RhinosF1: that task is most likely unrelated, please stop adding information to it
[09:01:41] <_joe_>	 it's about ongoing issues the specific person is having that I suspect have nothing to do with our infrastructure
[09:01:43] <RhinosF1>	 Stopping
[09:02:04] <_joe_>	 the time correlation did fool me too at first, btw 
[09:02:41] <RhinosF1>	 _joe_: slow to respond might be, I think that can be split off and yes it being at the same time and initially sounding like a connection error happened is confusing
[09:05:12] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10Joe) p:05Unbreak!→03Low This task reports continuing issues for...
[09:15:05] <wikibugs>	 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: zhwiki met a problem with access - https://phabricator.wikimedia.org/T326495 (10RhinosF1) 05duplicate→03Resolved p:05Triage→03Unbreak! This was resolved, SRE are aware of an issue around this time.  The previously linked task was unrelated.
[09:15:28] <wikibugs>	 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: zhwiki met a problem with access - https://phabricator.wikimedia.org/T326495 (10RhinosF1) >>! In T326495#8507147, @SD_hehua wrote: > Emmm not a duplicate?  Yep, 2 issues with slowness around the same time. Wasn’t connected though.
[09:16:00] <wikibugs>	 10SRE, 10Traffic, 10Chinese-Sites, 10Wikimedia-Incident: 2023-01-08 Wikimedia Connectivity Issues - https://phabricator.wikimedia.org/T326495 (10RhinosF1)
[09:17:37] <wikibugs>	 10SRE, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10RhinosF1) >>! In T301505#8507048, @AlexisJazz wrote: > I just received one trying to look up https://en.wiktionary.org/wiki/stuff. Not reproducible it se...
[09:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:27:28] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:55:46] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:03:48] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[12:05:18] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[12:49:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:52:43] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[13:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:06:42] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[15:09:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[16:52:43] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[17:23:52] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:39:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:44:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:34:40] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:35:36] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:35:50] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[20:35:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:36:22] <icinga-wm>	 PROBLEM - Host 2001:df2:e500:1:103:102:166:10 is DOWN: PING CRITICAL - Packet loss = 100%
[20:36:32] <icinga-wm>	 RECOVERY - Host 2001:df2:e500:1:103:102:166:10 is UP: PING OK - Packet loss = 0%, RTA = 241.04 ms
[20:36:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:36:38] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:43] <jinxer-wm>	 (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:47] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:37:08] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:37:08] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:37:08] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:37:08] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:37:14] <icinga-wm>	 PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:37:22] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:37:22] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:37:24] <icinga-wm>	 PROBLEM - SSH on ganeti5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:37:24] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:37:28] <icinga-wm>	 PROBLEM - SSH on prometheus5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:37:28] <icinga-wm>	 PROBLEM - SSH on ncredir5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:37:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne
[20:37:30] <icinga-wm>	 1.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:37:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmne
[20:37:30] <icinga-wm>	 7.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:37:30] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp5017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:37:41] <icinga-wm>	 PROBLEM - Host cr3-eqsin #page is DOWN: PING CRITICAL - Packet loss = 100%
[20:37:42] <icinga-wm>	 PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[20:37:58] <icinga-wm>	 PROBLEM - SSH on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:28] <icinga-wm>	 PROBLEM - SSH on lvs5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:28] <icinga-wm>	 PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[20:38:28] <icinga-wm>	 PROBLEM - SSH on cp5025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:28] <icinga-wm>	 PROBLEM - SSH on cp5026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:32] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5025 is OK: HTTP OK: HTTP/1.1 200 Ok - 48303 bytes in 2.047 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:38:32] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5031 is OK: HTTP OK: HTTP/1.1 200 Ok - 48329 bytes in 2.873 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:38:34] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.1 200 Ok - 48662 bytes in 4.111 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:38:44] <icinga-wm>	 RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:38:48] <icinga-wm>	 PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[20:38:48] <icinga-wm>	 RECOVERY - SSH on ganeti5007 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:52] <icinga-wm>	 RECOVERY - SSH on prometheus5001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:38:53] <cwhite>	 o/
[20:38:58] <icinga-wm>	 RECOVERY - SSH on ncredir5001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:39:10] <icinga-wm>	 PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:39:14] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:39:30] <icinga-wm>	 PROBLEM - SSH on cp5027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:39:33] <icinga-wm>	 RECOVERY - Host cr3-eqsin #page is UP: PING WARNING - Packet loss = 66%, RTA = 239.24 ms
[20:39:52] <icinga-wm>	 RECOVERY - SSH on cp5025 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:39:54] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:39:54] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 80 on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:39:58] <icinga-wm>	 RECOVERY - SSH on cp5026 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:40:00] <icinga-wm>	 RECOVERY - SSH on lvs5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:40:18] <icinga-wm>	 PROBLEM - Wikidough DoT Check -IPv6- on doh5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[20:40:20] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:40:24] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[20:40:24] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:40:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne
[20:40:35] <icinga-wm>	 2.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:40:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5021.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmne
[20:40:38] <icinga-wm>	 1.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5024.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:41:40] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5028 is OK: HTTP OK: HTTP/1.1 200 Ok - 48347 bytes in 4.855 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:41:42] <icinga-wm>	 RECOVERY - Wikidough DoT Check -IPv6- on doh5002 is OK: TCP OK - 1.160 second response time on 2001:df2:e500:1:103:102:166:5 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[20:41:46] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:41:49] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[20:41:52] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5017 is OK: HTTP OK: HTTP/1.1 200 Ok - 48747 bytes in 2.553 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:41:52] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:41:52] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[20:41:58] <icinga-wm>	 PROBLEM - SSH on durum5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:42:03] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:42:06] <jinxer-wm>	 (ProbeDown) firing: (7) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:42:10] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[20:42:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:42:14] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:42:26] <icinga-wm>	 RECOVERY - SSH on cp5027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:42:32] <icinga-wm>	 PROBLEM - SSH on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:42:36] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:43:08] <icinga-wm>	 PROBLEM - Wikidough DoH Check -IPv4- on doh5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[20:43:14] <icinga-wm>	 PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP
[20:43:22] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:43:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:43:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:43:50] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:43:50] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:44:00] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp5024 is OK: HTTP OK: HTTP/1.0 200 OK - 37140 bytes in 2.693 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:44:06] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp5030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.27: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:44:06] <icinga-wm>	 PROBLEM - SSH on cp5021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:44:12] <icinga-wm>	 PROBLEM - SSH on cp5019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:44:12] <icinga-wm>	 PROBLEM - SSH on ganeti5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:44:15] <cwhite>	 XioNoX topranks: seems the routers in eqsin are unreachable.  maintenance related?
[20:44:26] <icinga-wm>	 RECOVERY - Host cr2-eqsin IPv6 is UP: PING WARNING - Packet loss = 66%, RTA = 266.29 ms
[20:44:30] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp5017 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 7.691 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:44:32] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh5001 is OK: HTTP OK: HTTP/1.1 200 OK - 550 bytes in 2.116 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[20:44:36] <icinga-wm>	 RECOVERY - NTP peers on dns5004 is OK: NTP OK: Offset -0.001119 secs https://wikitech.wikimedia.org/wiki/NTP
[20:44:46] <sukhe>	 hi
[20:44:46] <RhinosF1>	 cwhite: check this morning, 13 hours ago
[20:44:50] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:44:54] <RhinosF1>	 Similar incident possibly
[20:44:54] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:44:54] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:44:54] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:44:54] <icinga-wm>	 RECOVERY - SSH on durum5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:45:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:45:16] <icinga-wm>	 RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:45:18] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:45:20] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5018 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 4.998 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:45:22] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:45:22] <icinga-wm>	 PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook
[20:45:24] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp5030 is OK: OK: read 8 bytes as expected https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:45:26] <icinga-wm>	 RECOVERY - SSH on cp5030 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:45:30] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:45:30] <icinga-wm>	 RECOVERY - SSH on cp5021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:45:30] <icinga-wm>	 RECOVERY - SSH on cp5018 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:45:38] <icinga-wm>	 RECOVERY - SSH on cp5019 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:45:38] <icinga-wm>	 RECOVERY - SSH on ganeti5006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:46:00] <icinga-wm>	 RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 260.84 ms
[20:46:02] <icinga-wm>	 RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset 0.001702 secs https://wikitech.wikimedia.org/wiki/NTP
[20:46:04] <sukhe>	 I ACKed the last remaining page
[20:46:10] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5023 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:46:30] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:46:32] * akosiaris around
[20:46:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:46:52] <icinga-wm>	 RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[20:46:52] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5024 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.476 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:46:52] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 80 on cp5030 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.501 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:46:54] * jhathaway around as well
[20:47:08] <sukhe>	 15:34:40 <+icinga-wm> PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:47:13] <sukhe>	 this is the first one
[20:47:29] <RhinosF1>	 akosiaris, jhathaway: see -sre too
[20:47:54] <jhathaway>	 RhinosF1: thanks
[20:47:55] <sukhe>	 so second incident with eqsin today?
[20:47:59] <XioNoX>	 hey
[20:48:10] <sukhe>	 hi XioNoX 
[20:48:22] <RhinosF1>	 Looks that way, no idea beyond time and similar alerts on the first though
[20:48:25] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[20:48:26] <XioNoX>	 looking
[20:48:29] <jinxer-wm>	 (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:48:34] <jinxer-wm>	 (ProbeDown) resolved: (6) Service text-https:443 has failed probes (http_text-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:49:31] <XioNoX>	 yeah, something's not right with the transport links
[20:49:44] <XioNoX>	 one has been down for a while, and maybe the other one has been miss-behsving
[20:49:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:50:08] <sukhe>	 ok, that explains some of the other Traffic alerts we are getting and seems to match those then
[20:50:32] <RhinosF1>	 XioNoX: cr3 was alerting a few times over night. I think around the time of this morning’s incident cr2 alerted too
[20:50:38] <jhathaway>	 I saw this on the calendar, but not related to cr3 I assume, "cr2-eqsin <-> cr4-ulsfo SingTelcircuit maintenance"
[20:51:13] <XioNoX>	 it could be, it's one of the two eqsin-USA links
[20:52:37] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[20:52:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[21:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:47:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:52:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown