[00:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[00:35:37] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[00:49:03] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:00:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:02:29] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:39] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:50:23] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:12:07] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[02:12:43] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:14:29] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[02:19:47] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:43:33] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:02:27] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:13] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:13] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:25:19] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.487 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:23:19] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[04:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[04:37:07] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:55:37] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:01:29] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:02:33] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:04:51] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:39:51] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:48:33] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:57:15] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:59:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:17:54] <wikibugs>	 (03PS7) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263
[06:17:56] <wikibugs>	 (03PS16) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261
[06:21:04] <wikibugs>	 (03PS1) 10Stang: ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940)
[06:41:13] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:41:54] <Tamzin>	 what time is it? it's "upstream connect error or disconnect/reset before headers. reset reason: overflow" time. fun!
[06:44:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[06:45:45] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[06:48:09] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[06:49:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[06:57:37] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:58:11] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220620T0700)
[07:02:55] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:14:33] <SandraEbele>	 !log killed Oozie wikidata-articleplaceholder_metrics-coord, wikidata-reliability_metrics-coord, and wikidata-specialentitydata_metrics-coord jobs.
[07:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:52] <SandraEbele>	 !log Started Airflow 3 Wikidata metrics jobs (Articleplaceholder, Reliability and SpecialEntityData metrics).
[07:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:51] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[07:41:01] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:51:05] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:51:24] <wikibugs>	 (03CR) 10DCausse: [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[07:54:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704) (owner: 10Kevin Bazira)
[07:55:12] <wikibugs>	 (03CR) 10Awight: "Almost there!  Still waiting for the train deployment for extensions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight)
[07:57:08] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] [DNM] Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight)
[07:58:53] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:05:21] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:09:46] <wikibugs>	 (03PS1) 10Ayounsi: Fix typoes found by Junoser [homer/public] - 10https://gerrit.wikimedia.org/r/806857
[08:11:53] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey)
[08:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:23:07] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[08:35:53] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[08:42:05] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:42:59] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:47] <wikibugs>	 (03CR) 10DCausse: [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[08:52:27] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:58:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:13:47] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:47] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:29:31] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:34:39] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:36:55] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:37:41] <vgutierrez>	 hmmm that smells like the Apache issue...
[09:54:19] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:55:56] <wikibugs>	 (03PS1) 10Jelto: DHCP: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806862 (https://phabricator.wikimedia.org/T307142)
[09:55:58] <wikibugs>	 (03PS1) 10Jelto: gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142)
[09:56:00] <wikibugs>	 (03PS1) 10Jelto: site: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142)
[10:14:07] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[10:15:41] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:16:57] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:18:13] <urbanecm>	 jouncebot: nowandnext
[10:18:13] <jouncebot>	 For the next 20 hour(s) and 41 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220620T0700)
[10:18:14] <jouncebot>	 In 20 hour(s) and 41 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0700)
[10:18:47] <urbanecm>	 okay, i should've realized monday is a holiday. well, rescheduling my "deploy throttle rule" reminder for tomorrow
[10:20:13] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:21:25] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:10] <wikibugs>	 (03CR) 10Base: [C: 03+1] ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang)
[10:32:54] <wikibugs>	 (03PS1) 10JMeybohm: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714)
[10:32:56] <wikibugs>	 (03PS1) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714)
[10:35:13] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872
[10:35:15] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873
[10:35:17] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874
[10:35:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[10:36:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[10:43:54] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593)
[10:45:23] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) > Due to a dependency on python-yaml in https://gerrit.wikimedia.org/r/admin/repos/operations/debs/cassandra-tools-wmf, we'd need to create a new package that depends on python...
[10:58:44] <wikibugs>	 (03PS1) 10JMeybohm: Add helm-state-metrics image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/806879 (https://phabricator.wikimedia.org/T310714)
[11:04:10] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:32] <wikibugs>	 (03Restored) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[11:10:18] <wikibugs>	 (03PS3) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711)
[11:11:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[11:14:38] <wikibugs>	 (03PS1) 10JMeybohm: Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714)
[11:14:41] <wikibugs>	 (03PS1) 10JMeybohm: Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714)
[11:16:21] <wikibugs>	 (03PS4) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711)
[11:17:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[11:18:31] <wikibugs>	 (03PS5) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711)
[11:19:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[11:29:52] <wikibugs>	 (03PS6) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711)
[11:31:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:34:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:45:40] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:00:19] <wikibugs>	 (03CR) 10Seddon: [C: 03+1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie)
[12:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:15:12] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:30] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[12:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[12:46:48] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:56:22] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:56:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Thanks! That's awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[12:57:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:18] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:57:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:58] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[12:58:12] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[12:58:18] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[12:58:35] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[12:58:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[12:59:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:32] <Amir1>	 here
[12:59:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:00:16] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[13:00:32] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[13:01:08] <jayme>	 o/
[13:02:18] <jinxer-wm>	 (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:02:56] <slyngs>	 You fixed it already?
[13:03:18] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:03:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:03:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[13:03:52] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:04:08] <Emperor>	 it's a holiday, but I got a page, are we OK?
[13:04:33] <apergos>	 see the other channel
[13:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:32:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for tomorrow per announcement: https://www.wikidata.org/w/index.php?title=Wikidata_talk:Lexicographical_data&oldid=1663922370#Yo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE))
[13:38:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "That looks great!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[13:43:02] <wikibugs>	 (03PS1) 10Ayounsi: Delete requirements.txt [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806908 (https://phabricator.wikimedia.org/T310745)
[14:05:06] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:58:05] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[14:58:48] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:07:48] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:08:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Should be ready to deploy at any time, as far as I understand, with no user-visible change yet – though starting with wmf.17, I think you " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[15:09:00] <wikibugs>	 (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:09:18] <wikibugs>	 (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:10:24] <wikibugs>	 (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:11:25] <wikibugs>	 (03PS2) 10Itamar Givon: Separate wmgWikibaseTermboxEnabled and wmgWikibaseSSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803497 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:12:24] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:14:43] <wikibugs>	 (03PS1) 10Stang: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505)
[15:18:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:21:16] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:24:13] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer)
[15:24:47] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) 05Open→03Stalled
[15:26:26] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm)
[15:43:40] <wikibugs>	 (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:43:47] <wikibugs>	 (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:43:53] <wikibugs>	 (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[15:44:09] <wikibugs>	 (03CR) 10Itamar Givon: [C: 03+1] Separate wmgWikibaseTermboxEnabled and wmgWikibaseSSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803497 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE))
[16:07:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Majavah) 05Stalled→03Open
[16:09:38] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869)
[16:09:40] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869)
[16:09:42] <wikibugs>	 (03CR) 10Ori: "I ran this through puppet compiler and verified it is a no-op on prod. Then I cherry-picked this on the Beta Cluster and verified it works" [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori)
[16:09:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:15:06] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:24:34] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 22.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:24:54] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:25:28] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 31.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[16:29:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 97.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:30:06] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:31:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 91.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[16:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[16:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:56:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori)
[17:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:12:18] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:14:36] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:17:09] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10TheresNoTime)
[17:42:26] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[17:53:50] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 870708 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[18:07:54] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[18:40:44] <wikibugs>	 (03PS1) 10Majavah: api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939
[18:40:46] <wikibugs>	 (03PS1) 10Majavah: api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940
[18:44:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah)
[18:45:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 (owner: 10Majavah)
[19:37:52] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul)
[19:41:44] <wikibugs>	 (03PS1) 10Stang: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017)
[19:52:22] <wikibugs>	 (03CR) 10Ori: api: support sha256 checksums (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah)
[20:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:25:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:26:41] <jinxer-wm>	 (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager  - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady
[20:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[20:54:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:05:46] <icinga-wm>	 PROBLEM - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[21:05:46] <icinga-wm>	 PROBLEM - HTTP on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[21:06:32] <icinga-wm>	 PROBLEM - HTTP on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[21:08:03] <icinga-wm>	 ACKNOWLEDGEMENT - HTTP on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 80: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[21:08:07] <icinga-wm>	 ACKNOWLEDGEMENT - HTTP on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 80: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[21:08:08] <icinga-wm>	 ACKNOWLEDGEMENT - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[21:16:42] <wikibugs>	 (03PS9) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692)
[21:16:44] <wikibugs>	 (03PS1) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692)
[21:17:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[21:19:29] <wikibugs>	 (03PS2) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692)
[21:21:06] <wikibugs>	 (03CR) 10Stang: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang)
[21:27:12] <wikibugs>	 (03PS10) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692)
[21:35:50] <wikibugs>	 (03Abandoned) 10Stang: zhwiki: Use wmgSiteLogoVariants to simplify logo variant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802133 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[21:44:32] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:52:06] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:00] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:41] <wikibugs>	 (03PS1) 10Stang: zhwikibooks: Add zh-hant variant logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620)
[23:47:02] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook