[00:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:03:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T321312)', diff saved to https://phabricator.wikimedia.org/P35965 and previous config saved to /var/cache/conftool/dbconfig/20221022-000300-ladsgroup.json [00:03:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [00:03:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [00:14:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [00:19:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [00:49:03] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:12] (03Abandoned) 10Brian Wolff: Customize article.html for better bidi support [puppet] - 10https://gerrit.wikimedia.org/r/543252 (https://phabricator.wikimedia.org/T235458) (owner: 10Brian Wolff) [01:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (6) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:57:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:09] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:56:05] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:13:50] upstream connect error or disconnect/reset before headers. reset reason: overflow [03:14:18] (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:23] I got a plain 503 Service Unavailable [03:14:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:14:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:15:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:15:34] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:16:17] (PHPFPMTooBusy) firing: (2) Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:16:21] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:16:33] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:16:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [03:16:55] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb6_443: Servers cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:17:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:17:09] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2415.codfw.wmnet, mw2273.codfw.wmnet, mw2274.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2379.codfw.wmnet, mw2311.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2371.codfw.wmnet, mw2378.codfw.wmnet, mw2413.codfw.wmnet, mw [03:17:09] fw.wmnet, mw2329.codfw.wmnet, mw2325.codfw.wmnet, mw2393.codfw.wmnet, mw2389.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2414.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2272.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw [03:17:09] mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2310.codfw.wmnet, mw2307.codfw.wmnet, mw2380.codfw.wmnet, mw2383.codfw.wmnet, mw233 https://wikitech.wikimedia.org/wiki/PyBal [03:17:19] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6575 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:17:37] oof [03:18:23] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.5152 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:18:37] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:19:09] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:19:18] (ProbeDown) resolved: (23) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:19:23] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:19:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:20:01] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:20:09] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:20:10] There were some HTTP 503 errors, right? [03:20:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:20:27] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:20:31] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [03:20:34] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:21:17] (PHPFPMTooBusy) resolved: (2) Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:21:23] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [03:21:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:21:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:22:29] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method= [03:22:40] (PHPFPMTooBusy) firing: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:23:05] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:23:25] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:23:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:23:33] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [03:24:23] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:05] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:25:17] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:18] (ProbeDown) resolved: (15) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:25:39] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:25:49] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:26:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:26:41] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:27:27] (PHPFPMTooBusy) resolved: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:33:25] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [03:35:27] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:39:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:32:31] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:34:37] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:57:23] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:59:29] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [05:05:51] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:14:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:26:23] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221022T0700) [07:56:28] (03PS1) 10Stang: plwikimedia: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845860 (https://phabricator.wikimedia.org/T321308) [08:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:08:55] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:45] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:28:53] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:30:43] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:33] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:19:41] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:19:53] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:27:31] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:27:41] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:51] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:31:37] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:04:23] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:20:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:20:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:25:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.809 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:26:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:48:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 9.381 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:01:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:43] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: First/Last name should not be mandatory fields - https://phabricator.wikimedia.org/T312020 (10MarcoAurelio) Filed upstream at: https://gitlab.com/mailman/mailman/-/issues/1034 [12:02:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:33] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:09:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.895 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:05:59] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:12:07] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.26 ms [13:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:41:57] (03PS1) 10Sohom Datta: Fix floating footer and wikieditor UI issue. [extensions/ProofreadPage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845751 (https://phabricator.wikimedia.org/T321344) [14:00:35] (03PS3) 10Majavah: openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623 [14:00:37] (03PS1) 10Majavah: openstack: wmcs-enc-cli: explicitely set accept/content-type as yaml [puppet] - 10https://gerrit.wikimedia.org/r/845868 (https://phabricator.wikimedia.org/T318503) [14:00:39] (03PS1) 10Majavah: openstack: encapi: support returning data in JSON [puppet] - 10https://gerrit.wikimedia.org/r/845869 (https://phabricator.wikimedia.org/T318503) [14:17:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:17:39] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:18:13] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:18:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:19:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.881 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:19:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:20:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 6.634 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:31] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:19:07] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:19:53] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [15:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:58:39] (03PS4) 10Sohom Datta: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) [16:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:25] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:07:31] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:02:17] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:08:25] RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 1.77 ms [17:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:56:05] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:56:29] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:10:51] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:15:07] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:17:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:23:13] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:26:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.420 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 7.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:35:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:50:07] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:57:31] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [19:04:55] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:05:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:06:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:16:03] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:17:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.432 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:27] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:19:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:48:41] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [20:00:13] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:01:05] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:07:11] RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 1.81 ms [20:15:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:16:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:17:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:19:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:20:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.591 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:27:33] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:30:47] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:41] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [21:39:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:42:09] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:44:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:48:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:54:33] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [22:31:45] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:03:57] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:40:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly