[00:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:22:10] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:44:54] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:02:54] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:05:24] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 11 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:23:38] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:20] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:44:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:22:50] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:23:38] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:12] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:40:47] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::nfsclient: remove ref to secondary_nfs_servers [puppet] - 10https://gerrit.wikimedia.org/r/810423 (owner: 10Majavah) [03:44:39] (03CR) 10Andrew Bogott: [C: 03+2] prometheus: openstack stale certs: ignore non-host certs [puppet] - 10https://gerrit.wikimedia.org/r/810425 (owner: 10Majavah) [03:51:36] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: add zones endpoint [puppet] - 10https://gerrit.wikimedia.org/r/800775 (owner: 10Majavah) [04:04:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:24:58] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:54] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:56:50] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:09:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:23:18] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:34:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:48:56] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220703T0700) [08:32:06] PROBLEM - MD RAID on elastic2049 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:32:07] ACKNOWLEDGEMENT - MD RAID on elastic2049 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T311939 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:32:13] 10SRE, 10ops-codfw: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10ops-monitoring-bot) [08:34:16] PROBLEM - Check systemd state on elastic2049 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-psi-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:41] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Marostegui) [08:53:04] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:22:42] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:59:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:24:06] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:34:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:06:32] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:21:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:54] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 1.212e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:23:16] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:23:17] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:23:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:16] PROBLEM - Apache HTTP on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:24:18] PROBLEM - Apache HTTP on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:24:22] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp1048.eqiad.wmnet, wtp1042.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1029.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1031.eqia [11:24:22] wtp1038.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1043.eqiad.wmnet, wtp1041.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1030.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1426.eqiad.wmnet, mw1380.eqiad.wmnet, mw1448.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1450.eqiad.wmnet, mw1386.eqiad.wmnet, mw1378.eqiad.wmnet, mw1390.eqiad.wmnet, mw1388.eqiad.wmnet, mw1449.eqiad.wmnet, mw1 [11:24:22] d.wmnet, mw1424.eqiad.wmnet, mw1444.eqiad.wmnet, mw1398.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1425.eqiad.wmnet, mw1316.eqiad.wmnet, mw1312.eqiad.wmn https://wikitech.wikimedia.org/wiki/PyBal [11:24:34] PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:24:46] PROBLEM - Apache HTTP on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:02] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:04] PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:10] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:17] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:25:22] PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:24] PROBLEM - Apache HTTP on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:25:38] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:25:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:25:44] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:25:54] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [11:25:58] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:02] PROBLEM - Apache HTTP on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:06] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:26:06] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:06] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:26:08] PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:08] PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:10] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:12] PROBLEM - Apache HTTP on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:14] PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:14] PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:14] PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:26:20] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:20] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:23] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:24] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.141:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [11:26:24] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:26] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:26:26] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [11:26:38] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:26:42] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.125:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:26:42] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:42] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.147:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [11:26:42] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:26:42] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.80:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:26:43] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:06] PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:14] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:20] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:20] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:20] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:20] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:24] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [11:27:24] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:24] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.118:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.118:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [11:27:24] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:24] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.48.120:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_ [11:27:25] 9%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:25] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:27:26] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:26] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.32.17:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:27:27] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:27] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:32] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:32] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:32] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:34] PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:34] PROBLEM - Apache HTTP on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:36] <_joe_> anyone else around? [11:27:40] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:41] <_joe_> I'm on the phone rn [11:27:42] PROBLEM - Apache HTTP on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:27:42] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp1029.eqiad.wmnet, wtp1048.eqiad.wmnet, wtp1042.eqiad.wmnet, wtp1043.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqia [11:27:42] wtp1030.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1032.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:27:44] PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:04] PROBLEM - PHP7 rendering on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:28:06] PROBLEM - PHP7 rendering on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:28:10] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:28:14] PROBLEM - Apache HTTP on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:18] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:24] PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:26] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:28:32] PROBLEM - PHP7 rendering on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:28:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:28:40] PROBLEM - PHP7 rendering on wtp1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:28:42] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28W [11:28:46] MCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:50] PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:28:54] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.48.169:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein [11:28:54] t on connection while downloading http://10.192.48.169:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:56] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [11:29:32] PROBLEM - PHP7 rendering on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:36] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:29:38] PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:38] PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:38] PROBLEM - PHP7 rendering on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:38] PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:38] PROBLEM - PHP7 rendering on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:44] PROBLEM - PHP7 rendering on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:44] PROBLEM - PHP7 rendering on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:29:52] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:56] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.64.48.179:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:29:56] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:56] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.192.16.81:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Timeout on connection while downloading http://10.192.16.81:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%2 [11:29:56] 2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:58] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:30:06] PROBLEM - PHP7 rendering on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:30:14] PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:30:17] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:31:02] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbas [11:31:04] PROBLEM - PHP7 rendering on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:23] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:28] PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:44] PROBLEM - PHP7 rendering on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:44] PROBLEM - PHP7 rendering on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:44] PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:44] PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:44] PROBLEM - PHP7 rendering on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:46] PROBLEM - PHP7 rendering on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:54] PROBLEM - PHP7 rendering on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:31:54] PROBLEM - PHP7 rendering on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:32:04] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:32:04] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:33:00] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:33:18] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:26] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:33:32] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetl [11:33:32] /{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:33:48] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wik [11:33:48] rg/wiki/RESTBase [11:34:18] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for Apr [11:34:18] 016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was r [11:34:18] https://wikitech.wikimedia.org/wiki/Wikifeeds [11:34:40] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [11:34:45] (JobUnavailable) firing: (3) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:34:54] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/ [11:34:54] d/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [11:35:16] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain [11:35:16] e/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:35:18] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:32] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:40] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2018.codfw.wmnet, restbase2024.codfw.wmnet, restbase2014.codfw.wmnet, restbase2019.codfw.wmnet, restbase2012.codfw.wmnet, restbase2017.codfw.wmnet, restbase2025.codfw.wmnet, restbase2013.codfw.wmnet, restbase2021.codfw.wmnet, restbase2023.codfw.wmnet, restbase2026.codfw.wmnet, restbase2020.codfw.wmnet, restbase201 [11:35:40] wmnet, restbase2022.codfw.wmnet, restbase2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:35:56] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expectin [11:35:57] https://wikitech.wikimedia.org/wiki/RESTBase [11:36:23] <_joe_> !log temporarily raised replicas for shellbox to 24 [11:36:23] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:38] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:37:06] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech [11:37:07] ia.org/wiki/RESTBase [11:37:28] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections t [11:37:28] ate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [11:38:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase2018.codfw.wmnet, restbase2014.codfw.wmnet, restbase2019.codfw.wmnet, restbase2012.codfw.wmnet, restbase2021.codfw.wmnet, restbase2026.codfw.wmnet, restbase2020.codfw.wmnet, restbase2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:38:18] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:28] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:38:28] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:39:06] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-l [11:39:06] le} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page fo [11:39:06] Salt article) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:39:06] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:41:18] (ProbeDown) firing: (4) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:04] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:42:04] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:42:26] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:44:24] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en. [11:44:24] a.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resourc [11:44:24] e} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) [11:44:24] ut before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:44:26] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from s [11:44:26] timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out b [11:44:27] response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:44:45] (JobUnavailable) firing: (5) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:46:22] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:46:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/talk [11:46:42] (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:48:30] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:48:30] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:48:30] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:48:30] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:49:45] (JobUnavailable) firing: (5) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:50:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [11:51:18] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:08] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [11:52:08] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:52:14] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=restbase.svc.eqiad.wmnet, port=7443): Read timed out. (read timeout=15)): /en.wikipedia.org/v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [11:52:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:54:48] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:56:00] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wik [11:56:00] rg/wiki/RESTBase [11:56:16] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia [11:56:16] i/RESTBase [11:56:52] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) i [11:56:52] AL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBa [11:57:02] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=restbase.svc.codfw.wmnet, port=7443): Read timed out. (read timeout=15)): /en.wikipedia.org/v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [11:59:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:00:06] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:01:56] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was receiv [12:01:56] wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:02:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:04:34] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:04:34] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:04:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:05:20] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: regen-zoom-level-tilerator-regen.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:27] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: regen-zoom-level-tilerator-regen.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:54] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:05:54] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:06:12] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.ulsfo.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:10:44] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:10:44] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:11:18] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:42] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=restbase.svc.codfw.wmnet, port=7443): Read timed out. (read timeout=15)): /en.wikipedia.org/v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:12:00] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.codfw.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:14:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:16:30] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a test page on enwiki returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/summary/{title} ( [12:16:30] ary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:16:40] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is [12:16:40] L: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:17:08] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.drmrs.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:18:02] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:18:02] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:18:04] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:18:58] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-sections for a te [12:18:58] on enwiki returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed [12:18:58] ore a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (M [12:18:58] check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:19:40] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:19:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:20:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:21:29] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:26] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:23:08] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:23:08] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:23:08] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/ [12:23:08] ck/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:23:18] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:23:58] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en. [12:23:58] a.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany pa [12:23:58] nt HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wiki [12:23:58] imedia.org/wiki/RESTBase [12:24:16] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech [12:24:16] ia.org/wiki/RESTBase [12:24:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:26:18] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:26:34] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structur [12:26:34] page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:26:54] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [12:27:38] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:27:50] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:28:20] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:28:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:29:32] Afternoon [12:29:45] (JobUnavailable) firing: (4) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:29:48] Where you running a 'train' at the moment? [12:30:41] I was noticing that the Linter service a script I use on Wikisource acesses with giving a lot of 503 errors [12:33:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:33:56] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wik [12:33:56] rg/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/feed/an [12:33:56] nts (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:34:14] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:34:45] (JobUnavailable) firing: (3) Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:36:18] (ProbeDown) firing: (3) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:38:14] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqsin.wikimedia.org, port=443): Read timed out. (read timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:39:45] (JobUnavailable) resolved: Reduced availability for job swagger_check_restbase_cluster_codfw in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:42:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:42:42] RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:42:42] RECOVERY - Apache HTTP on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:42:46] RECOVERY - PHP7 rendering on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:42:46] RECOVERY - PHP7 rendering on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:42:58] RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 1.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:43:06] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:43:16] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:43:36] RECOVERY - Apache HTTP on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 7.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:43:38] RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:43:46] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:43:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:44:00] RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 7.840 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:44:06] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:44:10] RECOVERY - PHP7 rendering on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 6.252 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:44:32] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:44:38] RECOVERY - Apache HTTP on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 8.389 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:44:44] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:46] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:46] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:57] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:04] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:04] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.702 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:45:04] RECOVERY - Apache HTTP on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 5.988 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:45:40] RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:45:46] RECOVERY - PHP7 rendering on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:45:46] RECOVERY - Apache HTTP on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:45:50] RECOVERY - PHP7 rendering on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 3.459 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:45:58] RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 7.787 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:00] RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:06] RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:06] RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 7.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:07] RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 7.373 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:10] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:10] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:10] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:12] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:12] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:23] (ProbeDown) firing: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:24] RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:24] RECOVERY - PHP7 rendering on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:24] RECOVERY - PHP7 rendering on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:26] RECOVERY - PHP7 rendering on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 1.740 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:26] RECOVERY - PHP7 rendering on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 562 bytes in 2.458 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:30] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:46:34] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:46:34] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:34] RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:34] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:34] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:36] RECOVERY - PHP7 rendering on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 0.531 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:37] RECOVERY - Apache HTTP on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 0.622 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:42] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:46:44] RECOVERY - PHP7 rendering on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:48] RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:50] RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:50] RECOVERY - PHP7 rendering on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:50] RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:50] RECOVERY - PHP7 rendering on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:50] RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:52] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:46:52] RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 548 bytes in 2.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:46:54] RECOVERY - PHP7 rendering on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 0.886 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:47:00] RECOVERY - PHP7 rendering on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:47:12] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:16] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:16] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:16] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:16] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:16] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:17] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:17] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:18] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:18] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:19] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:19] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:20] RECOVERY - PHP7 rendering on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:47:20] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:47:21] RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:47:22] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:47:24] RECOVERY - Apache HTTP on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:47:24] RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:47:30] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:30] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:30] RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:47:30] RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:47:58] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:48:04] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:48:10] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:12] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:18] (ProbeDown) resolved: Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:20] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:22] RECOVERY - PHP7 rendering on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:48:22] RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:48:32] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:32] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:48:36] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:36] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:18] (ProbeDown) resolved: (2) Service parsoid-php:443 has failed probes (http_parsoid-php_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:36] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Aklapper) [13:46:56] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Aklapper) [13:47:32] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Aklapper) [13:47:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Aklapper) [13:54:13] (03PS6) 10David Caro: cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 [14:01:05] (03CR) 10CI reject: [V: 04-1] cloudnet: add show, reboot_node and roll_reboot_cloudnets [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810368 (owner: 10David Caro) [14:34:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:06:30] (03PS2) 10David Caro: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [15:12:20] (03CR) 10CI reject: [V: 04-1] Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 (owner: 10David Caro) [15:31:43] (03PS3) 10David Caro: Add mypy, black and isort tests [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810295 [16:13:26] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:14:14] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:16:44] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:18:26] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:10] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:39:56] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:44:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:49:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:52:32] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 26.35 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:52:50] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 33.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:53:00] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 36.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:54:17] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:55:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 95.36 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:55:30] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 100.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [16:57:32] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 90.82 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:20:11] (03PS1) 10Andrew Bogott: Cloud metrics: split out prometheus config based on deployment. [puppet] - 10https://gerrit.wikimedia.org/r/810542 (https://phabricator.wikimedia.org/T311811) [17:21:30] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:22:24] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:25:30] (03CR) 10Majavah: [C: 03+1] Cloud metrics: split out prometheus config based on deployment. [puppet] - 10https://gerrit.wikimedia.org/r/810542 (https://phabricator.wikimedia.org/T311811) (owner: 10Andrew Bogott) [17:33:40] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:34:34] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:34:38] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:02:54] (03CR) 10Andrew Bogott: [C: 03+2] Cloud metrics: split out prometheus config based on deployment. [puppet] - 10https://gerrit.wikimedia.org/r/810542 (https://phabricator.wikimedia.org/T311811) (owner: 10Andrew Bogott) [18:16:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:21:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:20] (03PS1) 10Andrew Bogott: head.conf: duplicate www_authenticate_url as www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/810543 [19:02:19] (03CR) 10Andrew Bogott: [C: 03+2] head.conf: duplicate www_authenticate_url as www_authenticate_uri [puppet] - 10https://gerrit.wikimedia.org/r/810543 (owner: 10Andrew Bogott) [19:41:01] 10SRE-swift-storage, 10Commons: New broken files (premature end of file) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T284188 (10Aklapper) p:05High→03Triage 13 months later, is this still an issue? Test case got deleted in the meantime... [20:02:56] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:54:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:38:36] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:42:22] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:43:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:09:55] (03PS1) 10Andrew Bogott: magnum.conf: change the domain admin name [puppet] - 10https://gerrit.wikimedia.org/r/810549 [22:11:09] (03CR) 10Andrew Bogott: [C: 03+2] magnum.conf: change the domain admin name [puppet] - 10https://gerrit.wikimedia.org/r/810549 (owner: 10Andrew Bogott) [22:14:10] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:29:25] (03PS1) 10Andrew Bogott: Revert "magnum.conf: change the domain admin name" [puppet] - 10https://gerrit.wikimedia.org/r/810516 [22:31:47] (03CR) 10Andrew Bogott: [C: 03+2] Revert "magnum.conf: change the domain admin name" [puppet] - 10https://gerrit.wikimedia.org/r/810516 (owner: 10Andrew Bogott) [22:49:40] (03PS1) 10Stang: trwiki: Change old and new vector logos for 500k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810550 (https://phabricator.wikimedia.org/T311946) [23:00:06] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:33:06] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:34:36] PROBLEM - MegaRAID on an-worker1082 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:42:54] (03CR) 10Ori: "This change is ready for review." [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [23:44:29] (03PS4) 10Ori: Initial Debian packaging [software/varnish/libvmod-querysort] - 10https://gerrit.wikimedia.org/r/810551 (https://phabricator.wikimedia.org/T138093) [23:46:06] RECOVERY - MegaRAID on an-worker1082 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring