[00:22:15] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:23:07] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 544 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:24:24] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2069 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [00:24:38] looking [00:25:31] high db errors from s2 [00:26:15] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 1 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [00:27:05] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:27:25] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:27:36] PROBLEM - PHP7 rendering on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:27:36] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1836 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [00:27:45] Hey I'll be there in 3 [00:27:49] PROBLEM - PHP7 rendering on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:27:50] PROBLEM - PHP7 rendering on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:27:50] PROBLEM - PHP7 rendering on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:27:51] PROBLEM - Apache HTTP on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:27:51] PROBLEM - Apache HTTP on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:27:51] PROBLEM - Apache HTTP on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:27:51] PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:05] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:28:11] PROBLEM - Apache HTTP on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:19] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:23] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:23] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:23] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POS [00:28:29] PROBLEM - Apache HTTP on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:29] PROBLEM - Apache HTTP on mw1410 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:29] PROBLEM - PHP7 rendering on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:33] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:33] PROBLEM - PHP7 rendering on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:33] PROBLEM - PHP7 rendering on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:33] PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:35] PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:35] PROBLEM - PHP7 rendering on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:35] PROBLEM - Apache HTTP on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:35] PROBLEM - Apache HTTP on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:35] PROBLEM - PHP7 rendering on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:39] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:39] PROBLEM - PHP7 rendering on mw1421 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:39] PROBLEM - PHP7 rendering on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:39] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1404.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1375.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1342.eqiad.wmnet, mw1378.eqiad.wmnet, mw1390.eqiad.wmnet, mw1381.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmn [00:28:39] 77.eqiad.wmnet, mw1345.eqiad.wmnet, mw1396.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, mw1400.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad.wmnet, mw1443.eqiad.wmnet, mw1394.eqiad.wmnet, mw1382.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341.eqiad.wmnet, mw1402.eqiad.wmnet, mw1356 [00:28:40] mnet, mw1313.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1348.eqiad.wmnet, mw1404.eq https://wikitech.wikimedia.org/wiki/PyBal [00:28:43] PROBLEM - Apache HTTP on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:43] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:43] PROBLEM - PHP7 rendering on mw1386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:45] PROBLEM - Apache HTTP on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:45] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:45] PROBLEM - Apache HTTP on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:45] PROBLEM - Apache HTTP on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:45] PROBLEM - Apache HTTP on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:47] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:28:47] PROBLEM - Apache HTTP on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:47] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:48] PROBLEM - Apache HTTP on mw1421 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:48] PROBLEM - Apache HTTP on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:48] PROBLEM - Apache HTTP on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:49] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:49] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:51] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:53] PROBLEM - Apache HTTP on mw1392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:54] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:54] PROBLEM - Apache HTTP on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:55] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [00:28:55] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [00:28:55] PROBLEM - PHP7 rendering on mw1406 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:55] PROBLEM - Apache HTTP on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:55] PROBLEM - Apache HTTP on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:56] PROBLEM - PHP7 rendering on mw1425 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:56] PROBLEM - PHP7 rendering on mw1360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:57] PROBLEM - Apache HTTP on mw1383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:57] PROBLEM - Apache HTTP on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:58] PROBLEM - Apache HTTP on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:28:58] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:59] PROBLEM - PHP7 rendering on mw1381 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:28:59] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:00] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:00] PROBLEM - Apache HTTP on mw1359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:01] PROBLEM - PHP7 rendering on mw1378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:03] PROBLEM - PHP7 rendering on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:03] PROBLEM - Apache HTTP on mw1428 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:13] PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:13] PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:13] PROBLEM - Apache HTTP on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:15] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [00:29:17] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:21] PROBLEM - PHP7 rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:21] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:23] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [00:29:23] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [00:29:23] PROBLEM - PHP7 rendering on mw1402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:23] PROBLEM - PHP7 rendering on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:25] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.8904 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [00:29:25] PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:27] PROBLEM - PHP7 rendering on mw1398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:28] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:28] PROBLEM - Apache HTTP on mw1377 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:28] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:29] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:30] PROBLEM - Apache HTTP on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:33] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:33] PROBLEM - PHP7 rendering on mw1424 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:33] PROBLEM - Apache HTTP on mw1443 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:33] PROBLEM - PHP7 rendering on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:35] PROBLEM - PHP7 rendering on mw1356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:35] PROBLEM - Apache HTTP on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:39] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [00:29:41] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:41] PROBLEM - PHP7 rendering on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:42] PROBLEM - PHP7 rendering on mw1410 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:42] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:42] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:45] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:45] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read arti [00:29:45] January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [00:29:45] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:46] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:47] PROBLEM - PHP7 rendering on mw1362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:47] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:47] PROBLEM - PHP7 rendering on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:48] PROBLEM - Apache HTTP on mw1412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:48] PROBLEM - PHP7 rendering on mw1390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:49] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [00:29:49] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:50] PROBLEM - Apache HTTP on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:50] PROBLEM - PHP7 rendering on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:51] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:51] PROBLEM - PHP7 rendering on mw1357 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:29:52] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:52] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve [00:29:53] metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sectio [00:29:53] ned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:29:54] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:54] PROBLEM - Apache HTTP on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:55] PROBLEM - Apache HTTP on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:29:55] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:29:56] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was [00:29:56] : /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/random/title (retr [00:29:57] andom article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:29:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response [00:29:58] ived: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-secti [00:29:58] CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected statu [00:29:59] xpecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:29:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:30:00] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:00] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:03] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:30:03] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1448.eqiad.wmnet, mw1392.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1348.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1342.eqiad.wmnet, mw1378.eqiad.wmnet, mw1390.eqiad.wmnet, mw1362.eqiad.wmnet, mw1381.eqiad.wmnet, mw1388.eqiad.wmnet, mw1449.eqiad.wmn [00:30:03] 43.eqiad.wmnet, mw1421.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1375.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1408.eqiad.wmnet, mw1398.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1396.eqiad.wmnet, mw1400.eqiad.wmnet, mw1383.eqiad.wmnet, mw1427.eqiad.wmnet, mw1406.eqiad.wmnet, mw1443.eqiad.wmnet, mw1394.eqiad.wmnet, mw1382.eqiad.wmnet, mw1315.eqiad.wmnet, mw1341 [00:30:03] mnet, mw1402.eqiad.wmnet, mw1313.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1348.eqiad.wmnet, mw1390.eq https://wikitech.wikimedia.org/wiki/PyBal [00:30:03] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:04] PROBLEM - PHP7 rendering on mw1374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:04] PROBLEM - PHP7 rendering on mw1358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:05] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:30:07] PROBLEM - Apache HTTP on mw1422 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:07] PROBLEM - PHP7 rendering on mw1376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:07] PROBLEM - PHP7 rendering on mw1404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:11] PROBLEM - Apache HTTP on mw1408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:11] PROBLEM - PHP7 rendering on mw1363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:11] PROBLEM - PHP7 rendering on mw1382 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:13] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [00:30:13] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:30:14] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:14] PROBLEM - PHP7 rendering on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:15] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:15] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:17] PROBLEM - Apache HTTP on mw1423 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:19] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:23] PROBLEM - Apache HTTP on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:27] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:31] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:31] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:33] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:35] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:30:35] PROBLEM - Apache HTTP on mw1394 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:35] PROBLEM - PHP7 rendering on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:39] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:30:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:43] PROBLEM - Apache HTTP on mw1379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:43] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:30:45] PROBLEM - Apache HTTP on mw1426 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:30:49] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:31:05] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:31:42] PROBLEM - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:31:43] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:31:47] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [00:31:59] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 121 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:32:14] PROBLEM - MariaDB Replica IO: s2 #page on db1105 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:32:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [00:33:35] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:43] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:43] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:43] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:45] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:49] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:51] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:33:56] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [00:34:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:21] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:34:27] PROBLEM - PHP7 rendering on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:34:35] PROBLEM - Apache HTTP on mw1427 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:35:35] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:35:37] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:35:41] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:35:53] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 39 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:36:01] PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:36:08] RECOVERY - MariaDB Replica IO: s2 #page on db1105 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:36:08] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:36:11] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:36:21] PROBLEM - PHP7 rendering on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:36:27] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [00:36:41] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:36:57] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.7493 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:37:37] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:37:38] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:37:59] PROBLEM - Apache HTTP on mw1450 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:07] PROBLEM - PHP7 rendering on mw1447 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:15] PROBLEM - Apache HTTP on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:15] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:19] PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:21] PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:25] PROBLEM - PHP7 rendering on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:29] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:31] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:31] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:33] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:35] PROBLEM - PHP7 rendering on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:37] PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:41] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:41] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:43] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:47] PROBLEM - Apache HTTP on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:53] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:38:59] PROBLEM - PHP7 rendering on mw1449 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:07] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:09] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:13] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:19] PROBLEM - Apache HTTP on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:21] PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:27] PROBLEM - PHP7 rendering on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:27] PROBLEM - PHP7 rendering on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:27] PROBLEM - PHP7 rendering on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:31] PROBLEM - Apache HTTP on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:34] PROBLEM - MariaDB Replica IO: s2 #page on db1146 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:39:34] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:34] PROBLEM - Apache HTTP on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:37] PROBLEM - Apache HTTP on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:47] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:49] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:57] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:57] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:57] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:39:59] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:59] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:01] PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:01] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:05] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:05] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:05] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:05] PROBLEM - PHP7 rendering on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,php} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:07] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:11] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:11] PROBLEM - PHP7 rendering on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:13] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:17] PROBLEM - Apache HTTP on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:18] PROBLEM - PHP7 rendering on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:19] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:20] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:21] PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:22] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:27] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:28] PROBLEM - PHP7 rendering on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:29] PROBLEM - Apache HTTP on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:29] PROBLEM - PHP7 rendering on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:29] PROBLEM - PHP7 rendering on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:29] PROBLEM - PHP7 rendering on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:31] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:31] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:31] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:31] PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:31] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:32] PROBLEM - PHP7 rendering on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:32] PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:33] PROBLEM - PHP7 rendering on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:33] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:34] PROBLEM - PHP7 rendering on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:34] PROBLEM - PHP7 rendering on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:35] PROBLEM - PHP7 rendering on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:35] PROBLEM - PHP7 rendering on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:36] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:37] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:37] PROBLEM - PHP7 rendering on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:37] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:38] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:38] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:39] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:39] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:40] PROBLEM - PHP7 rendering on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:40] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:41] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:41] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:42] PROBLEM - Apache HTTP on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:42] PROBLEM - PHP7 rendering on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:43] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:43] PROBLEM - Apache HTTP on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:44] PROBLEM - Apache HTTP on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:44] PROBLEM - PHP7 rendering on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:45] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:45] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:49] PROBLEM - Apache HTTP on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:51] PROBLEM - PHP7 rendering on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:51] PROBLEM - PHP7 rendering on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:51] PROBLEM - PHP7 rendering on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:51] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:51] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:52] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:54] PROBLEM - MariaDB Replica SQL: s2 #page on db1146 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:40:55] PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:55] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:55] RECOVERY - PHP7 rendering on mw1449 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.519 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:55] PROBLEM - PHP7 rendering on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:40:56] PROBLEM - Apache HTTP on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:57] PROBLEM - Apache HTTP on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:40:59] PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:41:07] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:07] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:17] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:19] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:21] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:27] PROBLEM - PHP7 rendering on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:41:32] RECOVERY - MariaDB Replica IO: s2 #page on db1146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:41:33] PROBLEM - PHP7 rendering on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:41:33] PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:33] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:33] PROBLEM - Apache HTTP on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:53] PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:41:55] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:57] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:41:59] PROBLEM - PHP7 rendering on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:00] PROBLEM - MariaDB Replica IO: s2 #page on db1105 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:42:01] RECOVERY - PHP7 rendering on mw1447 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.447 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:05] PROBLEM - PHP7 rendering on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:05] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:42:17] PROBLEM - PHP7 rendering on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:17] PROBLEM - PHP7 rendering on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:17] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:21] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:42:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:42:29] PROBLEM - Apache HTTP on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:42:29] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:33] PROBLEM - Apache HTTP on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:42:35] PROBLEM - Apache HTTP on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:42:39] PROBLEM - ircecho is not relaying messages - eqiad on irc1001 is CRITICAL: 0.8333 lt 1 https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho [00:42:45] PROBLEM - PHP7 rendering on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:42:51] RECOVERY - Apache HTTP on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.973 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:43:15] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:43:30] PROBLEM - LVS apaches eqiad port 80/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet IPv4 #page on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:43:31] PROBLEM - PHP7 rendering on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:43:31] PROBLEM - PHP7 rendering on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:43:49] PROBLEM - ircecho is not relaying messages - codfw on irc2001 is CRITICAL: 0.6667 lt 1 https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho [00:43:51] PROBLEM - PHP7 rendering on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:43:53] RECOVERY - Apache HTTP on mw1450 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:43:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [00:44:15] PROBLEM - Apache HTTP on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:15] PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:44] PROBLEM - MariaDB read only s2 on db1146 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [00:45:25] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:27] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:27] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:31] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:35] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:35] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:45:37] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:38] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:43] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:43] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:45:47] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:51] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:45:58] RECOVERY - MariaDB Replica IO: s2 #page on db1105 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:45:58] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:03] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:03] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:03] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:17] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:23] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:29] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:29] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:29] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:31] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:46:59] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:01] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:01] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:01] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp1081.eqiad.wmnet, cp1075.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1083.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wi [00:47:01] ikimedia.org/wiki/PyBal [00:47:05] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:14] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [00:47:15] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:19] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:27] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:47:49] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:49] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:49] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:55] PROBLEM - Apache HTTP on mw1444 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:47:57] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:58] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:59] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:47:59] PROBLEM - LVS eventgate-analytics-external eqiad port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.eqiad.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.52 and port 4692: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:48:19] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:48:19] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [00:48:27] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:48:29] RECOVERY - Apache HTTP on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:29] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.321 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:31] RECOVERY - PHP7 rendering on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.356 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:31] RECOVERY - PHP7 rendering on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.946 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:31] RECOVERY - Apache HTTP on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.015 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:31] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:31] RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.272 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:32] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.499 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:32] RECOVERY - PHP7 rendering on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.788 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:33] RECOVERY - PHP7 rendering on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.163 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:33] RECOVERY - Apache HTTP on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:34] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.540 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:34] RECOVERY - Apache HTTP on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.771 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:35] RECOVERY - PHP7 rendering on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.829 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:35] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.834 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:36] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.880 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:36] RECOVERY - Apache HTTP on mw1316 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:37] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:37] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:38] RECOVERY - Apache HTTP on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.948 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:38] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.972 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:39] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.947 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:39] RECOVERY - Apache HTTP on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.749 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:40] RECOVERY - PHP7 rendering on mw1406 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.805 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:40] RECOVERY - PHP7 rendering on mw1425 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.916 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:41] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.655 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:41] RECOVERY - PHP7 rendering on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.705 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:42] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.333 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:42] RECOVERY - Apache HTTP on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.408 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:42] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10Ivi104) can confirm, all projects, all languages. [00:48:43] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:43] RECOVERY - Apache HTTP on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.941 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:44] RECOVERY - PHP7 rendering on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.324 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:45] RECOVERY - Apache HTTP on mw1428 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:45] RECOVERY - PHP7 rendering on mw1422 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:45] RECOVERY - PHP7 rendering on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.786 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:46] RECOVERY - PHP7 rendering on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.821 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:46] RECOVERY - PHP7 rendering on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.914 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:47] RECOVERY - PHP7 rendering on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.433 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:47] RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:48] PROBLEM - MariaDB Replica SQL: s2 #page on db1146 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:48:48] RECOVERY - Apache HTTP on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.884 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:49] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.386 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:49] RECOVERY - Apache HTTP on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.403 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:50] RECOVERY - Apache HTTP on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:50] RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:51] RECOVERY - PHP7 rendering on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.409 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:53] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:53] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:53] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.826 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:57] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:48:58] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:48:59] PROBLEM - eventgate-analytics-external LVS eqiad on eventgate-analytics-external.svc.eqiad.wmnet is CRITICAL: /robots.txt (robots.txt check) is CRITICAL: Test robots.txt check returned the unexpected status 503 (expecting: 200): / (root with no query params) is CRITICAL: Test root with no query params returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [00:48:59] RECOVERY - PHP7 rendering on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.477 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:00] RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:49:00] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.940 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:01] RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.694 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:01] RECOVERY - PHP7 rendering on mw1402 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.846 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:01] RECOVERY - PHP7 rendering on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.976 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:01] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.988 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:01] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:03] RECOVERY - Apache HTTP on mw1377 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:03] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:05] RECOVERY - PHP7 rendering on mw1398 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.905 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:05] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.356 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:05] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:05] RECOVERY - Apache HTTP on mw1382 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.567 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:05] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.577 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:06] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:06] RECOVERY - Apache HTTP on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:07] RECOVERY - PHP7 rendering on mw1424 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:07] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:08] RECOVERY - PHP7 rendering on mw1345 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:08] RECOVERY - Apache HTTP on mw1443 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:09] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.798 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:09] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:11] RECOVERY - PHP7 rendering on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:11] RECOVERY - PHP7 rendering on mw1340 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.588 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:11] RECOVERY - PHP7 rendering on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:12] RECOVERY - PHP7 rendering on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:12] RECOVERY - PHP7 rendering on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:13] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.820 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:13] RECOVERY - PHP7 rendering on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.976 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:16] RECOVERY - LVS apaches eqiad port 80/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet IPv4 #page on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 302 Found - 656 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:49:16] RECOVERY - PHP7 rendering on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:16] RECOVERY - PHP7 rendering on mw1410 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:16] RECOVERY - PHP7 rendering on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:17] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:18] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10JJMC89) [00:49:19] RECOVERY - Apache HTTP on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:19] RECOVERY - PHP7 rendering on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:19] RECOVERY - PHP7 rendering on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:20] RECOVERY - LVS api eqiad port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24863 bytes in 1.717 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [00:49:20] RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:20] RECOVERY - PHP7 rendering on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:20] RECOVERY - PHP7 rendering on mw1362 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:21] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [00:49:21] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:49:22] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:22] RECOVERY - Apache HTTP on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:23] RECOVERY - Apache HTTP on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:23] RECOVERY - PHP7 rendering on mw1390 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:24] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:24] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:25] RECOVERY - PHP7 rendering on mw1444 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:25] RECOVERY - Apache HTTP on mw1412 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.874 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:26] RECOVERY - PHP7 rendering on mw1394 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:26] RECOVERY - PHP7 rendering on mw1392 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:27] RECOVERY - PHP7 rendering on mw1408 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:27] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:28] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:28] RECOVERY - PHP7 rendering on mw1357 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:29] RECOVERY - Apache HTTP on mw1360 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.377 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:29] RECOVERY - PHP7 rendering on mw1379 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:30] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:49:30] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:31] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10Urbanecm) This is known and actively being investigated. Please stand by. [00:49:31] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:31] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:32] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:32] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:33] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:33] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:34] RECOVERY - Apache HTTP on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:34] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:35] RECOVERY - Apache HTTP on mw1363 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:35] RECOVERY - Apache HTTP on mw1376 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.246 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:36] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:36] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:37] RECOVERY - PHP7 rendering on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:37] RECOVERY - Apache HTTP on mw1356 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:38] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:49:38] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:49:39] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:40] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:40] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:49:41] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:49:42] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:42] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:49:43] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.703 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:43] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:44] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:49:44] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:45] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:49:45] RECOVERY - PHP7 rendering on mw1358 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:46] RECOVERY - PHP7 rendering on mw1342 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:46] RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:47] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:49:47] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:50:33] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10Sakura_emad) Confirm on Meta, ckb, en [https://isup.me/] [00:50:41] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10Tol) From T291312, some other errors I have experienced: * `upstream connect error or disconnect/reset before headers. reset reason: connection failure` * `upstream connect error or disconnect/reset before header... [00:51:04] RECOVERY - eventgate-analytics-external LVS eqiad on eventgate-analytics-external.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [00:51:04] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:09] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [00:51:09] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:13] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:26] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10MSG17) Some stats (thanks TNT on the unofficial Wikimedia Discord: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1631915084782&to=1631925884782&var-site=All&var-cache_type=text&var-cache... [00:51:29] 10SRE, 10Traffic: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10Urbanecm) p:05Triage→03Unbreak! [00:51:41] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:41] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1085 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:41] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:44] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [00:51:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 83.89 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:51:49] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1079 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:49] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:51:49] RECOVERY - ircecho is not relaying messages - codfw on irc2001 is OK: (C)1 lt (W)2 lt 9.75 https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho [00:51:51] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:52:05] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:52:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:05] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:52:09] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1081 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [00:52:11] 10SRE, 10Traffic, 10Wikimedia-Incident: Wikimedia sites down 18 Sept 2021 - https://phabricator.wikimedia.org/T291311 (10AntiCompositeNumber) [00:52:17] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [00:52:21] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:52:32] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Peachey88) [00:52:49] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 0.9858 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:16] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [00:53:29] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:43] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:53:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [00:55:28] PROBLEM - MariaDB Replica IO: s2 #page on db1146 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:55:33] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:55:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:27] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:56:42] RECOVERY - MariaDB Replica SQL: s2 #page on db1146 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:57:24] RECOVERY - MariaDB Replica IO: s2 #page on db1146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:51] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Tol) Enwiki, wikidata, and meta are all loading normally for me now (frontend & API). [01:01:05] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.23/includes/libs/rdbms/database/Database.php: (no justification provided) (duration: 01m 03s) [01:01:05] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [01:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:10] PROBLEM - MariaDB Replica IO: s2 #page on db1146 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:05:06] RECOVERY - MariaDB Replica IO: s2 #page on db1146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:06:17] RECOVERY - MariaDB read only s2 on db1146 is OK: Version 10.4.15-MariaDB-log, Uptime 13451356s, read_only: True, event_scheduler: True, 3860.22 QPS, connection latency: 0.004360s, query latency: 0.000515s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [01:06:33] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0274 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [01:06:43] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Thryduulf) en.wp (at least) was agonisingly slow shortly before it all went down, and even after en.wp came back up I wasn't able to allow Oauth access to login here. [01:06:44] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8238 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [01:07:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:07:17] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [01:07:22] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8141 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [01:07:57] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:08:05] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [01:10:21] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Ivi104) >>! In T291311#7363415, @Thryduulf wrote: > en.wp (at least) was agonisingly slow shortly before it all went down, and even after en.wp came back up I wasn't able to allow Oauth acce... [01:11:28] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10LD) Can confirm : fr.wp went slow around 02:10 AM CEST, came back up around 03:00 AM CEST. No abnormal filtered detections from fr.wp's AbuseFilter. [01:12:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [01:45:13] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:45] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10RLazarus) 05Open→03Resolved a:03RLazarus This should be fully resolved as of about 1:10 UTC, sorry for the trouble and thanks for all the reports. Because the root cause was a DoS vec... [01:47:49] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.23/includes/libs/rdbms/database/Database.php: (no justification provided) (duration: 00m 57s) [01:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:43] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:05:07] 10SRE, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Krinkle) [02:06:13] 10SRE, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Krinkle) [02:07:17] 10SRE, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.40; 2020-07-07), 10Patch-For-Review, and 2 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [02:07:43] 10SRE, 10Traffic, 10MW-1.35-notes (1.35.0-wmf.40; 2020-07-07), 10Patch-For-Review, and 2 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [03:58:51] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:07] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:01:49] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [04:04:47] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 243.98 ms [04:12:49] (03PS1) 10Krinkle: mediawiki: Add request ID to php-wmerrors error page [puppet] - 10https://gerrit.wikimedia.org/r/721923 (https://phabricator.wikimedia.org/T291192) [04:12:51] (03PS1) 10Krinkle: mediawiki: Set php-wmerrors reqId to "unknown" for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/721924 (https://phabricator.wikimedia.org/T291192) [04:12:53] (03PS1) 10Krinkle: mediawiki: Move statsd call from php-wmerrors page to end of script [puppet] - 10https://gerrit.wikimedia.org/r/721925 [06:16:10] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 6 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) Looking at Logstash, we h... [08:28:52] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:30:47] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 16 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [09:27:27] 10SRE, 10Traffic, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Nehaoua) since 23:00 UTC i can't Save any modification and always this messages, i check abusefilter = 0 result {F34646785} {F34646789} [11:15:51] 10SRE, 10DNS, 10Software-Licensing: Add LICENSE to operations/dns scripts - https://phabricator.wikimedia.org/T291323 (10Majavah) [13:15:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:17:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:31:15] (03PS1) 10Lucas Werkmeister: Perform rolling restarts on kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) [15:32:00] (03CR) 10jerkins-bot: [V: 04-1] Perform rolling restarts on kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [15:34:25] (03PS2) 10Lucas Werkmeister: Perform rolling restarts on kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) [15:35:16] (03CR) 10Lucas Werkmeister: "I’ve tested this in the lexeme-forms tool; it works and produces no nonzero exit codes with the following test command:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [15:36:35] (03CR) 10Lucas Werkmeister: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [15:40:04] (03CR) 10Majavah: "My main worry is that this leaves empty replicaset objects laying around:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [16:01:18] (03CR) 10Lucas Werkmeister: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [16:28:54] (03PS3) 10Lucas Werkmeister: Perform rolling restarts on kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) [16:29:53] (03CR) 10Lucas Werkmeister: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [16:31:05] (03CR) 10Lucas Werkmeister: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [16:38:29] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:42:31] (03CR) 10SDineshKumar: [C: 03+1] "+1 Volunteers outside org are now able to clone from gitlab as well." [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [16:44:27] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [19:22:05] (03PS1) 10Ladsgroup: snapshot: Drop flaggedimages from dumps [puppet] - 10https://gerrit.wikimedia.org/r/722026 (https://phabricator.wikimedia.org/T290340) [19:33:36] (03CR) 10Ladsgroup: [C: 03+1] rancid: convert crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:18:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:19:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:43:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down