[01:03:50] (03PS1) 10Zabe: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 [01:05:47] (03PS2) 10Zabe: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 [01:05:56] (03PS3) 10Zabe: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) [01:18:14] (03CR) 10DannyS712: "The tests are going to fail because https://gerrit.wikimedia.org/r/c/mediawiki/core/+/723688 wasn't backported to the wmf branches" [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [01:24:27] (03CR) 10jerkins-bot: [V: 04-1] Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [01:36:24] (03PS1) 10Zabe: PHPUnit: enable convertDeprecationsToExceptions [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723703 (https://phabricator.wikimedia.org/T291731) [01:36:57] (03PS4) 10Zabe: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) [01:57:56] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:59:06] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 8.19e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:59:54] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:38:48] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [02:40:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:02:22] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1424 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:02:30] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:31] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [03:02:32] PROBLEM - Apache HTTP on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:02:34] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1887 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:03:10] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:18] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:20] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:28] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:34] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:34] PROBLEM - Apache HTTP on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:34] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:34] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:36] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.005025 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [03:03:38] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:38] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:42] PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:42] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:42] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:42] PROBLEM - PHP7 rendering on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:42] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:42] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:48] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:50] PROBLEM - PHP7 rendering on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:50] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:50] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:50] PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:50] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:50] PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:51] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:51] PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:56] * legoktm is logging on [03:03:58] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:58] PROBLEM - PHP7 rendering on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:58] PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:58] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:58] PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:03:59] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:59] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:02] PROBLEM - PHP7 rendering on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:04] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [03:04:06] PROBLEM - PHP7 rendering on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:14] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:04:18] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [03:04:18] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:20] PROBLEM - PHP7 rendering on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:20] PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:20] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:21] PROBLEM - Apache HTTP on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:21] PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:21] PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:21] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:24] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:24] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for Apr [03:04:24] 016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{dom [03:04:24] page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [03:04:26] PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:26] PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:26] PROBLEM - Apache HTTP on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:26] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:26] PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:27] PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:27] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:28] PROBLEM - Apache HTTP on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:28] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:29] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:29] PROBLEM - PHP7 rendering on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:30] PROBLEM - PHP7 rendering on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:30] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:31] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:31] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:32] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:32] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:33] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:33] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:34] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:34] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:35] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:35] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:36] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:36] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:07:02] PROBLEM - PHP7 rendering on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:07:02] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:07:02] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [03:07:04] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:07:18] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:09:00] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:09:52] PROBLEM - Apache HTTP on wtp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:09:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:10:18] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/source/{title}/{to} (Suggest a source title to use for transla [03:10:18] med out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [03:10:46] PROBLEM - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.52 and port 4692: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:11:20] PROBLEM - ircecho is not relaying messages - codfw on irc2001 is CRITICAL: 0.1167 lt 1 https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho [03:11:50] PROBLEM - ircecho is not relaying messages - eqiad on irc1001 is CRITICAL: 0.08333 lt 1 https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho [03:11:50] RECOVERY - Apache HTTP on wtp1042 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:11:58] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:12] 10SRE, 10Traffic, 10Wikimedia-production-error: 2021-09-25 Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10DannyS712) [03:13:20] 10SRE, 10Traffic, 10Wikimedia-production-error: 2021-09-25 Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10DannyS712) p:05Triage→03Unbreak! [03:14:02] !log killing queries on db1105 [03:14:02] PROBLEM - LVS eventgate-analytics-external codfw port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.codfw.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.52 and port 4692: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:14:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes2015.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/ [03:14:24] al [03:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [03:15:00] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics-external_4692: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/ [03:15:00] al [03:15:54] RECOVERY - PHP7 rendering on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.691 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:15:56] RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.768 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:00] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:16:00] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:16:06] !log killed queries on db1099 [03:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:10] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.395 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:10] RECOVERY - PHP7 rendering on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.220 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:16] RECOVERY - Apache HTTP on mw1448 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.557 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:16] RECOVERY - PHP7 rendering on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.737 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:16] RECOVERY - PHP7 rendering on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.848 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:17] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.518 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:18] RECOVERY - PHP7 rendering on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:18] RECOVERY - PHP7 rendering on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.558 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:18] RECOVERY - PHP7 rendering on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.648 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:20] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.805 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:22] RECOVERY - PHP7 rendering on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.538 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:22] RECOVERY - PHP7 rendering on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:22] RECOVERY - Apache HTTP on mw1426 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.325 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:22] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.541 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:24] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.853 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:24] RECOVERY - Apache HTTP on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.705 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:24] RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.825 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:24] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:26] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.950 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:26] RECOVERY - Apache HTTP on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.447 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:26] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.551 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:26] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:16:27] RECOVERY - Apache HTTP on mw1421 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.547 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:28] RECOVERY - PHP7 rendering on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:28] RECOVERY - Apache HTTP on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.017 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:30] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:32] RECOVERY - PHP7 rendering on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.453 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:32] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:34] RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:34] RECOVERY - PHP7 rendering on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:34] RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:34] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:34] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:35] RECOVERY - PHP7 rendering on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.455 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:36] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:36] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:36] RECOVERY - Apache HTTP on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.224 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:37] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:37] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.692 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:38] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.812 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:40] RECOVERY - Apache HTTP on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:40] RECOVERY - PHP7 rendering on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:40] RECOVERY - PHP7 rendering on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:40] RECOVERY - Apache HTTP on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:40] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:41] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:41] RECOVERY - PHP7 rendering on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.513 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:41] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:42] RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:42] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:16:43] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:43] RECOVERY - PHP7 rendering on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.519 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:44] RECOVERY - PHP7 rendering on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.243 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:44] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.384 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:45] RECOVERY - PHP7 rendering on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:45] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:46] RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:46] RECOVERY - PHP7 rendering on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:47] RECOVERY - PHP7 rendering on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:47] RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.465 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:48] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.974 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:48] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:16:49] RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.806 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:49] RECOVERY - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24865 bytes in 7.126 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:16:50] RECOVERY - PHP7 rendering on mw1381 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.682 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:50] RECOVERY - PHP7 rendering on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.971 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:51] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.541 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:51] RECOVERY - PHP7 rendering on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.756 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:52] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.174 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:16:52] RECOVERY - Apache HTTP on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.394 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:16:56] RECOVERY - PHP7 rendering on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:00] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:17:02] RECOVERY - PHP7 rendering on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:02] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:02] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:02] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:02] RECOVERY - PHP7 rendering on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:04] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:17:06] RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:06] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:17:06] RECOVERY - Apache HTTP on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:07] RECOVERY - Apache HTTP on mw1386 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:07] RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:07] RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:07] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:07] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:08] RECOVERY - Apache HTTP on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:08] RECOVERY - PHP7 rendering on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:09] RECOVERY - Apache HTTP on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:09] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:10] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.290 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:10] RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.359 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:17:11] RECOVERY - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 668 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:17:11] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.866 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:17:12] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:17:40] gg icigna [03:18:05] i think that's how you know you killed the right queries [03:18:09] mhm [03:18:50] RECOVERY - PHP7 rendering on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.258 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:18:50] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:18:50] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:50] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:50] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:50] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:52] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:52] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.238 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:18:54] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.982 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:18:54] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:18:56] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:18:58] RECOVERY - Apache HTTP on mw1359 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:19:22] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:19:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:19:42] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:19:46] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 0.9544 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:19:50] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [03:19:51] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:20:02] PROBLEM - MariaDB Replica SQL: s1 #page on db1105 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:20:09] PROBLEM - MariaDB Replica IO: s1 #page on db1105 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:20:12] RECOVERY - Apache HTTP on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:20:28] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [03:20:32] incident response is happening in #mediawiki_security [03:20:42] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:21:07] (aw wishful thinking) [03:21:16] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:21:17] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:21:52] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:22:00] 10SRE, 10Traffic: 2021-09-25 Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10Peachey88) [03:22:09] RECOVERY - MariaDB Replica SQL: s1 #page on db1105 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:22:17] RECOVERY - MariaDB Replica IO: s1 #page on db1105 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:22:32] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [03:22:35] 10SRE, 10Traffic: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10Peachey88) [03:22:59] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6347 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [03:23:03] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10AntiCompositeNumber) [03:23:04] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:23:50] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:24:01] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.7796 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [03:24:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:24:14] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:24:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [03:25:02] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:25:42] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:26:16] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:26:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:29:24] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:29:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:32:07] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10ItsPugle) The production (live) sites seem to be back up now, albeit site loading seems to be a bit slower than usual (presumably due to an influx of requests as the site came back onl... [03:37:47] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10PorkchopGMX) This error appeared as HTTP 503 when trying to log into Phabricator via MediaWiki. For a brief moment, the standard “our servers are undergoing maintenance” error appeared... [03:38:55] y'all must love the site is down phab tasks :') [03:39:43] as long as there's only one [03:39:53] (public task, at least) [03:44:16] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-09-23 03:33:51 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:47:06] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10Andrew) 05Open→03Resolved a:03Andrew SREs are investigating and responding to this issue; it should be largely resolved by now. As a DOS-related issue the specifics will not be... [04:05:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:20] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210926T0700) [08:38:34] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:40:30] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 20 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [12:33:18] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [12:33:24] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.143e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:33:24] 10SRE, 10Wikimedia-Mailing-lists: Mailing List for the Wikimedians of United Arab Emirates User Group - https://phabricator.wikimedia.org/T291769 (10Vikoula5) [12:39:20] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [13:32:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:37:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:58:06] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [13:58:16] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:42] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [14:10:52] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:15] (03PS9) 10Majavah: Make `webservice shell` scriptable [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) (owner: 10BryanDavis) [14:40:44] (03CR) 10Majavah: [C: 03+2] "Fixed merge conflicts in PS9. I tested this again properly and it works fine, let's ship it!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) (owner: 10BryanDavis) [14:41:23] (03Merged) 10jenkins-bot: Make `webservice shell` scriptable [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) (owner: 10BryanDavis) [14:51:16] !log volker-e@deploy1002 Started deploy [design/style-guide@aac0ae9]: Deploy design/style-guide: aac0ae9 “Apps”: Fix image path (#490) [14:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:22] !log volker-e@deploy1002 Finished deploy [design/style-guide@aac0ae9]: Deploy design/style-guide: aac0ae9 “Apps”: Fix image path (#490) (duration: 00m 06s) [14:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:37] 10SRE, 10Commons, 10Thumbor: Image thumbnail not displayed - https://phabricator.wikimedia.org/T291763 (10Multichill) [15:30:01] 10SRE, 10Thumbor: Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests" - https://phabricator.wikimedia.org/T285875 (10AntiCompositeNumber) [16:04:17] (03PS1) 10Majavah: P::toolforge: Use composer package on buster [puppet] - 10https://gerrit.wikimedia.org/r/723760 (https://phabricator.wikimedia.org/T287900) [17:33:26] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 5.065e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:18:16] (03PS1) 10Majavah: Identify when venvs are for wrong Python versions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/723761 (https://phabricator.wikimedia.org/T276626) [19:00:54] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:00] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.142e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:38:36] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:44:42] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.99 ms