[00:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [00:35:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [00:49:03] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:00:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:02:29] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:39] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:50:23] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:12:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [02:12:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:29] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:19:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:33] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:13] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:13] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:19] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.487 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:23:19] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [04:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [04:37:07] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:55:37] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:01:29] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:02:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 5 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:04:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:39:51] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:48:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:57:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:59:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:17:54] (03PS7) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [06:17:56] (03PS16) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [06:21:04] (03PS1) 10Stang: ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) [06:41:13] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:54] what time is it? it's "upstream connect error or disconnect/reset before headers. reset reason: overflow" time. fun! [06:44:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [06:45:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:48:09] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:49:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [06:57:37] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:11] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220620T0700) [07:02:55] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:14:33] !log killed Oozie wikidata-articleplaceholder_metrics-coord, wikidata-reliability_metrics-coord, and wikidata-specialentitydata_metrics-coord jobs. [07:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:52] !log Started Airflow 3 Wikidata metrics jobs (Articleplaceholder, Reliability and SpecialEntityData metrics). [07:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:51] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:41:01] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:51:05] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:51:24] (03CR) 10DCausse: [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [07:54:11] (03CR) 10Elukey: [C: 03+2] ml-services: add ptwiki draftquality isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/806394 (https://phabricator.wikimedia.org/T310704) (owner: 10Kevin Bazira) [07:55:12] (03CR) 10Awight: "Almost there! Still waiting for the train deployment for extensions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:57:08] (03CR) 10WMDE-Fisch: [C: 03+1] [DNM] Drop deprecated feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804609 (https://phabricator.wikimedia.org/T310684) (owner: 10Awight) [07:58:53] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:05:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:09:46] (03PS1) 10Ayounsi: Fix typoes found by Junoser [homer/public] - 10https://gerrit.wikimedia.org/r/806857 [08:11:53] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) [08:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:23:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:35:53] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [08:42:05] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:42:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:47] (03CR) 10DCausse: [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [08:52:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:58:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:13:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_mlserve:prod.service,swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:47] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:29:31] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:34:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:36:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:41] hmmm that smells like the Apache issue... [09:54:19] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:55:56] (03PS1) 10Jelto: DHCP: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806862 (https://phabricator.wikimedia.org/T307142) [09:55:58] (03PS1) 10Jelto: gitlab/acme_chief: remove gitlab2001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/806863 (https://phabricator.wikimedia.org/T307142) [09:56:00] (03PS1) 10Jelto: site: remove gitlab2001 [puppet] - 10https://gerrit.wikimedia.org/r/806864 (https://phabricator.wikimedia.org/T307142) [10:14:07] (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [10:15:41] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:16:57] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:18:13] jouncebot: nowandnext [10:18:13] For the next 20 hour(s) and 41 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220620T0700) [10:18:14] In 20 hour(s) and 41 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220621T0700) [10:18:47] okay, i should've realized monday is a holiday. well, rescheduling my "deploy throttle rule" reminder for tomorrow [10:20:13] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:21:25] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:10] (03CR) 10Base: [C: 03+1] ukwikibooks: Add NS102 (Рецепт) to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806847 (https://phabricator.wikimedia.org/T310940) (owner: 10Stang) [10:32:54] (03PS1) 10JMeybohm: Add helm-state-metrics helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) [10:32:56] (03PS1) 10JMeybohm: Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) [10:35:13] (03PS1) 10Lucas Werkmeister (WMDE): Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 [10:35:15] (03PS1) 10Lucas Werkmeister (WMDE): Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 [10:35:17] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 [10:35:39] (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [10:36:34] (03CR) 10CI reject: [V: 04-1] Deploy helm-state-metrics to staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/806871 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [10:43:54] (03PS1) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) [10:45:23] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) > Due to a dependency on python-yaml in https://gerrit.wikimedia.org/r/admin/repos/operations/debs/cassandra-tools-wmf, we'd need to create a new package that depends on python... [10:58:44] (03PS1) 10JMeybohm: Add helm-state-metrics image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/806879 (https://phabricator.wikimedia.org/T310714) [11:04:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:32] (03Restored) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [11:10:18] (03PS3) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [11:11:07] (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [11:14:38] (03PS1) 10JMeybohm: Initial commit of helm-state-metrics [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806888 (https://phabricator.wikimedia.org/T310714) [11:14:41] (03PS1) 10JMeybohm: Add vendor dir [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/806889 (https://phabricator.wikimedia.org/T310714) [11:16:21] (03PS4) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [11:17:19] (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [11:18:31] (03PS5) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [11:19:29] (03CR) 10CI reject: [V: 04-1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [11:29:52] (03PS6) 10Matthias Mullie: Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) [11:31:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:34:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:45:40] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:00:19] (03CR) 10Seddon: [C: 03+1] Add ImageSuggestions to extension-list and config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766615 (https://phabricator.wikimedia.org/T302711) (owner: 10Matthias Mullie) [12:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:15:12] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [12:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [12:46:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:56:22] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:56:29] (03CR) 10Ayounsi: [C: 03+1] "Thanks! That's awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [12:57:16] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:16] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:16] PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:26] PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:26] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:26] PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:28] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:30] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:34] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:34] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:34] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:57:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:58:12] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:58:18] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:58:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [12:58:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [12:59:24] RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:26] RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:26] RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:32] here [12:59:36] RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:36] RECOVERY - Apache HTTP on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:36] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:38] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:40] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:44] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:44] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:44] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:00:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:00:32] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:01:08] o/ [13:02:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:56] You fixed it already? [13:03:18] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:03:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:03:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [13:03:52] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:08] it's a holiday, but I got a page, are we OK? [13:04:33] see the other channel [13:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:32:13] (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for tomorrow per announcement: https://www.wikidata.org/w/index.php?title=Wikidata_talk:Lexicographical_data&oldid=1663922370#Yo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806877 (https://phabricator.wikimedia.org/T309593) (owner: 10Lucas Werkmeister (WMDE)) [13:38:55] (03CR) 10Ayounsi: [C: 03+1] "That looks great!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [13:43:02] (03PS1) 10Ayounsi: Delete requirements.txt [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806908 (https://phabricator.wikimedia.org/T310745) [14:05:06] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:58:05] (03PS3) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [14:58:48] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:07:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:08:23] (03CR) 10Lucas Werkmeister (WMDE): "Should be ready to deploy at any time, as far as I understand, with no user-visible change yet – though starting with wmf.17, I think you " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse) [15:09:00] (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:09:18] (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:10:24] (03PS2) 10Itamar Givon: Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:11:25] (03PS2) 10Itamar Givon: Separate wmgWikibaseTermboxEnabled and wmgWikibaseSSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803497 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:12:24] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:14:43] (03PS1) 10Stang: fawiktionary: Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806921 (https://phabricator.wikimedia.org/T308505) [15:18:58] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:24:13] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) [15:24:47] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) 05Open→03Stalled [15:26:26] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/806870 (https://phabricator.wikimedia.org/T310714) (owner: 10JMeybohm) [15:43:40] (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:43:47] (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:43:53] (03CR) 10Itamar Givon: [C: 03+1] Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [15:44:09] (03CR) 10Itamar Givon: [C: 03+1] Separate wmgWikibaseTermboxEnabled and wmgWikibaseSSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803497 (https://phabricator.wikimedia.org/T304328) (owner: 10Lucas Werkmeister (WMDE)) [16:07:34] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Majavah) 05Stalled→03Open [16:09:38] (03PS1) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806930 (https://phabricator.wikimedia.org/T307869) [16:09:40] (03PS1) 10Lucas Werkmeister (WMDE): Configure wbsearchentities profile parameter on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806931 (https://phabricator.wikimedia.org/T307869) [16:09:42] (03CR) 10Ori: "I ran this through puppet compiler and verified it is a no-op on prod. Then I cherry-picked this on the Beta Cluster and verified it works" [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori) [16:09:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:15:06] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:24:34] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 22.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:24:54] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 43.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:25:28] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 31.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [16:29:30] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 97.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:30:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 102.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:31:30] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 91.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [16:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:56:07] (03CR) 10CDanis: [C: 03+1] varnish: sort query parameters on the Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/806488 (owner: 10Ori) [17:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:12:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 4 day(s) (Sat 25 Jun 2022 07:55:09 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:14:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 24 Aug 2022 07:48:40 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:09] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10TheresNoTime) [17:42:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:53:50] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 870708 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [18:07:54] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:40:44] (03PS1) 10Majavah: api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 [18:40:46] (03PS1) 10Majavah: api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 [18:44:48] (03CR) 10CI reject: [V: 04-1] api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah) [18:45:54] (03CR) 10CI reject: [V: 04-1] api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 (owner: 10Majavah) [19:37:52] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) [19:41:44] (03PS1) 10Stang: zhwikiquote: Disable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806941 (https://phabricator.wikimedia.org/T311017) [19:52:22] (03CR) 10Ori: api: support sha256 checksums (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah) [20:12:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:25:42] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:26:41] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [20:36:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [20:54:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:05:42] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:05:46] PROBLEM - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [21:05:46] PROBLEM - HTTP on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [21:06:32] PROBLEM - HTTP on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [21:08:03] ACKNOWLEDGEMENT - HTTP on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 80: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [21:08:07] ACKNOWLEDGEMENT - HTTP on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 80: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [21:08:08] ACKNOWLEDGEMENT - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused Andrew Bogott Downtime expired, these are a work in progress https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [21:16:42] (03PS9) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [21:16:44] (03PS1) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) [21:17:40] (03CR) 10CI reject: [V: 04-1] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [21:19:29] (03PS2) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806944 (https://phabricator.wikimedia.org/T305692) [21:21:06] (03CR) 10Stang: Add language fallback support for wmgSiteLogoVariants (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) (owner: 10Stang) [21:27:12] (03PS10) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799415 (https://phabricator.wikimedia.org/T305692) [21:35:50] (03Abandoned) 10Stang: zhwiki: Use wmgSiteLogoVariants to simplify logo variant settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802133 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [21:44:32] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:00] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:41] (03PS1) 10Stang: zhwikibooks: Add zh-hant variant logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806947 (https://phabricator.wikimedia.org/T308620) [23:47:02] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook