[00:06:52] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:38] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:48] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:00] upstream connect error or disconnect/reset before headers. reset reason: overflow [01:23:22] on refresh the page loaded, but veeeery sloowly [01:24:06] seeing reports from others in #wikipedia-en and on Discord [01:24:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [01:26:37] AntiComposite: I can confirm the message as well but I can't seem to reproduce it now [01:26:47] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:26:49] oh oh [01:26:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:27:03] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [01:27:14] AntiComposite: nvm :) [01:27:34] yeah it's spotty [01:29:33] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13 [01:29:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:29:49] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=3 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [01:29:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [01:32:29] from here it looks like things have recovered [01:32:44] yes thanks for alerting us [01:36:43] My Wikipedia! Down! [01:39:37] Might still be ongoing [01:40:15] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:00:48] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:02:36] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:18] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:59:50] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2022-04-11 07:59:19 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:17:34] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:38] (03PS1) 10Jcrespo: Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) [12:16:59] (03CR) 10jerkins-bot: [V: 04-1] Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [12:20:04] (03PS2) 10Jcrespo: Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) [12:20:46] (03CR) 10jerkins-bot: [V: 04-1] Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) (owner: 10Jcrespo) [12:22:13] (03PS3) 10Jcrespo: Use the shlex.quote method to escape hosts and paths [software/transferpy] - 10https://gerrit.wikimedia.org/r/770089 (https://phabricator.wikimedia.org/T256749) [12:26:52] (03CR) 10Jcrespo: "recheck" [software/transferpy] - 10https://gerrit.wikimedia.org/r/674577 (https://phabricator.wikimedia.org/T268258) (owner: 10Palak199) [13:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:00:39] (NodeTextfileStale) firing: Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [14:05:39] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [16:04:38] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:28:20] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:40:12] (03PS1) 10JHathaway: WIP vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/770099 [17:06:22] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org [18:07:08] PROBLEM - SSH on thumbor2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:18] (03PS1) 10Zabe: wikitech_private: stop writing to wmf* constants [puppet] - 10https://gerrit.wikimedia.org/r/770102 (https://phabricator.wikimedia.org/T45956) [18:21:58] (03PS2) 10Zabe: wikitech_private: stop writing to wmf* constants [puppet] - 10https://gerrit.wikimedia.org/r/770102 (https://phabricator.wikimedia.org/T45956) [18:31:52] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:10:22] RECOVERY - SSH on thumbor2003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:23] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:05:54] (NodeTextfileStale) firing: (2) Stale textfile for cloudnet2002-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org