[00:18:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:47] 10SRE, 10ops-codfw, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T290362 (10Papaul) @wiki_willy thanks will work on this next week [00:23:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:21] (03PS1) 10Krinkle: ci: Fix profile::ci to be compatible with new empheral storage [puppet] - 10https://gerrit.wikimedia.org/r/717732 (https://phabricator.wikimedia.org/T284774) [00:32:39] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:44:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:34:17] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1890 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:34:45] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 1314 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:34:53] PROBLEM - Apache HTTP on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:34:55] PROBLEM - PHP7 rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:34:57] PROBLEM - Apache HTTP on mw2366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:03] PROBLEM - Apache HTTP on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:05] PROBLEM - Apache HTTP on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:05] PROBLEM - PHP7 rendering on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:11] PROBLEM - Apache HTTP on mw2396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:11] PROBLEM - PHP7 rendering on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:11] PROBLEM - PHP7 rendering on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:13] PROBLEM - PHP7 rendering on mw2404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:19] PROBLEM - PHP7 rendering on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:19] PROBLEM - PHP7 rendering on mw2298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:19] PROBLEM - Apache HTTP on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:21] PROBLEM - Apache HTTP on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:21] PROBLEM - Apache HTTP on mw2298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:21] PROBLEM - PHP7 rendering on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:23] PROBLEM - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:23] PROBLEM - Apache HTTP on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:23] PROBLEM - PHP7 rendering on mw2320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:23] PROBLEM - Apache HTTP on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:23] PROBLEM - PHP7 rendering on mw2354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:24] PROBLEM - PHP7 rendering on mw2396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:24] PROBLEM - PHP7 rendering on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:25] PROBLEM - Apache HTTP on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:25] PROBLEM - PHP7 rendering on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:27] PROBLEM - PHP7 rendering on mw2291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:27] PROBLEM - Apache HTTP on mw2317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:29] PROBLEM - Apache HTTP on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:29] PROBLEM - Apache HTTP on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:29] PROBLEM - PHP7 rendering on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:31] PROBLEM - PHP7 rendering on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:31] PROBLEM - PHP7 rendering on mw2297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:33] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:33] PROBLEM - Apache HTTP on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:33] PROBLEM - PHP7 rendering on mw2400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:33] PROBLEM - PHP7 rendering on mw2317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:33] PROBLEM - PHP7 rendering on mw2302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:34] PROBLEM - PHP7 rendering on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:34] PROBLEM - PHP7 rendering on mw2405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:35] PROBLEM - Apache HTTP on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:35] PROBLEM - PHP7 rendering on mw2366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:37] PROBLEM - PHP7 rendering on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:37] PROBLEM - Apache HTTP on mw2324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:38] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is CRITICAL: 0.141 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [02:35:38] PROBLEM - PHP7 rendering on mw2322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:38] PROBLEM - PHP7 rendering on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:38] PROBLEM - Apache HTTP on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:39] PROBLEM - PHP7 rendering on mw2402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:42] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is CRITICAL: 0.04167 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [02:35:42] PROBLEM - Apache HTTP on mw2321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:47] PROBLEM - Apache HTTP on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:51] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:35:51] PROBLEM - Apache HTTP on mw2404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:51] PROBLEM - Apache HTTP on mw2320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:51] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:35:51] PROBLEM - Apache HTTP on mw2354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:53] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:35:55] PROBLEM - Apache HTTP on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:55] PROBLEM - PHP7 rendering on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:55] PROBLEM - PHP7 rendering on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:55] PROBLEM - Apache HTTP on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:55] PROBLEM - PHP7 rendering on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:57] PROBLEM - PHP7 rendering on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:57] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:35:57] PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:57] PROBLEM - Apache HTTP on mw2356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:57] PROBLEM - Apache HTTP on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:59] PROBLEM - Apache HTTP on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:59] PROBLEM - Apache HTTP on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:59] PROBLEM - PHP7 rendering on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:59] PROBLEM - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:35:59] PROBLEM - Apache HTTP on mw2402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:00] PROBLEM - Apache HTTP on mw2405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:01] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [02:36:01] PROBLEM - PHP7 rendering on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:01] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:03] PROBLEM - PHP7 rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:07] PROBLEM - PHP7 rendering on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:07] PROBLEM - PHP7 rendering on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:07] PROBLEM - PHP7 rendering on mw2368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:09] PROBLEM - PHP7 rendering on mw2372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:09] PROBLEM - Apache HTTP on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:11] PROBLEM - PHP7 rendering on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:11] PROBLEM - Apache HTTP on mw2368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:21] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:23] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:24] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [02:36:25] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:25] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:25] PROBLEM - Apache HTTP on mw2297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:25] PROBLEM - Apache HTTP on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:25] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:26] oh shit [02:36:27] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:27] PROBLEM - Apache HTTP on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:27] PROBLEM - PHP7 rendering on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:29] PROBLEM - PHP7 rendering on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:29] PROBLEM - Apache HTTP on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:29] PROBLEM - PHP7 rendering on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:29] PROBLEM - Apache HTTP on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:31] PROBLEM - PHP7 rendering on mw2356 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:31] PROBLEM - PHP7 rendering on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:31] PROBLEM - Apache HTTP on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:31] PROBLEM - Apache HTTP on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:31] PROBLEM - PHP7 rendering on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:32] PROBLEM - PHP7 rendering on mw2321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:32] PROBLEM - PHP7 rendering on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:33] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:33] PROBLEM - PHP7 rendering on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:34] PROBLEM - Apache HTTP on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:35] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:36:37] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw2284.codfw.wmnet, mw2376.codfw.wmnet, mw2396.codfw.wmnet, mw2286.codfw.wmnet, mw2328.codfw.wmnet, mw2334.codfw.wmnet, mw2360.codfw.wmnet, mw2326.codfw.wmnet, mw2300.codfw.wmnet, mw2288.codfw.wmnet, mw2252.codfw.wmnet, mw2261.codfw.wmnet, mw2322.codfw.wmnet, mw2294.codfw.wmnet, mw2253.codfw.wmnet, mw2324.codfw.wmnet, mw2358.codfw.wmn [02:36:37] 98.codfw.wmnet, mw2356.codfw.wmnet, mw2368.codfw.wmnet, mw2330.codfw.wmnet, mw2364.codfw.wmnet, mw2289.codfw.wmnet, mw2306.codfw.wmnet, mw2352.codfw.wmnet, mw2400.codfw.wmnet, mw2405.codfw.wmnet, mw2297.codfw.wmnet, mw2304.codfw.wmnet, mw2295.codfw.wmnet, mw2370.codfw.wmnet, mw2399.codfw.wmnet, mw2293.codfw.wmnet, mw2291.codfw.wmnet, mw2362.codfw.wmnet, mw2332.codfw.wmnet, mw2366.codfw.wmnet, mw2319.codfw.wmnet, mw2285.codfw.wmnet, mw2290 [02:36:37] mnet, mw2374.codfw.wmnet, mw2404.codfw.wmnet, mw2318.codfw.wmnet, mw2292.codfw.wmnet, mw2323.codfw.wmnet, mw2354.codfw.wmnet, mw2350.codfw.wmnet, mw2299.codfw.wmnet, mw2262.codfw.wmnet, https://wikitech.wikimedia.org/wiki/PyBal [02:36:39] PROBLEM - Apache HTTP on mw2291 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:41] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not fou [02:36:41] nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [02:36:41] PROBLEM - Apache HTTP on mw2322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:43] PROBLEM - PHP7 rendering on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:43] PROBLEM - PHP7 rendering on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:45] PROBLEM - Apache HTTP on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:45] PROBLEM - Apache HTTP on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:47] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve [02:36:47] metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get [02:36:47] for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:36:47] PROBLEM - Apache HTTP on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:49] PROBLEM - Apache HTTP on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:49] PROBLEM - Apache HTTP on mw2302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:49] PROBLEM - PHP7 rendering on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:51] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POS [02:36:53] PROBLEM - PHP7 rendering on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:53] PROBLEM - Apache HTTP on mw2296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:57] PROBLEM - Apache HTTP on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:57] PROBLEM - PHP7 rendering on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:36:57] PROBLEM - Apache HTTP on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:57] PROBLEM - Apache HTTP on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:36:59] PROBLEM - Apache HTTP on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:03] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.9242 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:37:03] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [02:37:03] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 0.9844 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [02:37:03] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:05] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for Apr [02:37:05] 016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was r [02:37:05] /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [02:37:05] PROBLEM - PHP7 rendering on mw2324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:07] PROBLEM - PHP7 rendering on mw2397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:11] PROBLEM - Apache HTTP on mw2397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:13] PROBLEM - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:13] PROBLEM - PHP7 rendering on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:13] PROBLEM - Apache HTTP on mw2372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:14] PROBLEM - PHP7 rendering on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:14] PROBLEM - PHP7 rendering on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:14] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:14] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:15] PROBLEM - Apache HTTP on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:19] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:19] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:19] PROBLEM - Apache HTTP on mw2376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:19] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw2284.codfw.wmnet, mw2376.codfw.wmnet, mw2396.codfw.wmnet, mw2286.codfw.wmnet, mw2328.codfw.wmnet, mw2334.codfw.wmnet, mw2364.codfw.wmnet, mw2326.codfw.wmnet, mw2298.codfw.wmnet, mw2372.codfw.wmnet, mw2252.codfw.wmnet, mw2261.codfw.wmnet, mw2322.codfw.wmnet, mw2321.codfw.wmnet, mw2294.codfw.wmnet, mw2253.codfw.wmnet, mw2324.codfw.wmn [02:37:19] 96.codfw.wmnet, mw2283.codfw.wmnet, mw2358.codfw.wmnet, mw2302.codfw.wmnet, mw2320.codfw.wmnet, mw2370.codfw.wmnet, mw2368.codfw.wmnet, mw2330.codfw.wmnet, mw2308.codfw.wmnet, mw2289.codfw.wmnet, mw2352.codfw.wmnet, mw2405.codfw.wmnet, mw2304.codfw.wmnet, mw2295.codfw.wmnet, mw2399.codfw.wmnet, mw2293.codfw.wmnet, mw2317.codfw.wmnet, mw2402.codfw.wmnet, mw2332.codfw.wmnet, mw2403.codfw.wmnet, mw2366.codfw.wmnet, mw2319.codfw.wmnet, mw2285 [02:37:19] mnet, mw2290.codfw.wmnet, mw2374.codfw.wmnet, mw2404.codfw.wmnet, mw2350.codfw.wmnet, mw2354.codfw.wmnet, mw2398.codfw.wmnet, mw2299.codfw.wmnet, mw2262.codfw.wmnet are marked down but https://wikitech.wikimedia.org/wiki/PyBal [02:37:19] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/description/{title} (Get description for test page) is CRITICAL: Test Get description for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media [02:37:20] m test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/m [02:37:20] ctions/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get prev [02:37:21] le HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:37:21] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:22] PROBLEM - Apache HTTP on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:37:23] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [02:37:23] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:25] PROBLEM - PHP7 rendering on mw2376 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:25] PROBLEM - PHP7 rendering on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:37:27] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:29] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:29] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:29] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:31] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:37:35] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:35] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:37] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:37] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:39] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [02:37:39] e data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random [02:37:39] title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [02:37:39] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:39] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a respo [02:37:39] received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [02:37:40] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:37:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:41] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:37:41] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:43] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [02:37:45] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:37:49] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:53] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [02:37:53] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [02:37:57] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:38:03] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:38:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:38:27] RECOVERY - PHP7 rendering on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.388 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:38:35] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:38:40] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [02:38:42] PROBLEM - Varnish has reduced HTTP availability #page on alert1001 is CRITICAL: job=varnish-text https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [02:38:55] PROBLEM - PHP7 rendering on mw2374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:38:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_codfw,swagger_check_eventgate_analytics_external_cluster_codfw} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:39:15] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 504 (expecting: 200) https://wi [02:39:15] ikimedia.org/wiki/CX [02:39:37] PROBLEM - High average POST latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [02:40:35] PROBLEM - LVS eventgate-analytics-external codfw port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.codfw.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.52 and port 4692: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:42:39] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.6903 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:13] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [02:43:19] PROBLEM - Apache HTTP on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:43:27] PROBLEM - PHP7 rendering on mw2308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:43:29] PROBLEM - Apache HTTP on mw2374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:44:43] PROBLEM - PHP7 rendering on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:46:29] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.7642 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:47:21] RECOVERY - Apache HTTP on mw2374 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.235 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:48:13] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10Wikimedia-Incident: Can not access Wikidata - https://phabricator.wikimedia.org/T290373 (10Bugreporter) [02:48:31] RECOVERY - PHP7 rendering on mw2409 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.370 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:48:35] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [02:50:31] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [02:50:35] RECOVERY - PHP7 rendering on mw2374 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:50:59] RECOVERY - Apache HTTP on mw2376 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.898 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:51:05] RECOVERY - PHP7 rendering on mw2376 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.605 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:52:17] PROBLEM - Apache HTTP on mw2374 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:53:37] RECOVERY - PHP7 rendering on mw2252 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.433 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:53:43] RECOVERY - PHP7 rendering on mw2368 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.471 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:53:45] RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:53:45] RECOVERY - PHP7 rendering on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.844 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:53:47] RECOVERY - Apache HTTP on mw2368 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.868 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:53:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:54:07] RECOVERY - Apache HTTP on mw2374 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.983 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:09] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24867 bytes in 4.981 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:54:10] PROBLEM - MariaDB Replica Lag: s4 #page on db2137 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:54:10] RECOVERY - PHP7 rendering on mw2321 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.203 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:23] RECOVERY - Apache HTTP on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.695 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:27] RECOVERY - Apache HTTP on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.986 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:31] RECOVERY - Apache HTTP on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.582 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:33] RECOVERY - PHP7 rendering on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.981 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:33] RECOVERY - Apache HTTP on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.757 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:45] RECOVERY - Apache HTTP on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:45] RECOVERY - Apache HTTP on mw2396 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.411 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:45] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:54:47] RECOVERY - PHP7 rendering on mw2253 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.357 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:49] RECOVERY - PHP7 rendering on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.766 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:51] RECOVERY - PHP7 rendering on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:51] RECOVERY - Apache HTTP on mw2252 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:53] RECOVERY - PHP7 rendering on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.618 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:53] RECOVERY - PHP7 rendering on mw2404 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.732 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:53] RECOVERY - Apache HTTP on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.847 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:53] RECOVERY - PHP7 rendering on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.840 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:53] RECOVERY - PHP7 rendering on mw2354 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:54] RECOVERY - Apache HTTP on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:54] RECOVERY - PHP7 rendering on mw2298 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.359 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:55] RECOVERY - Apache HTTP on mw2298 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:55] RECOVERY - PHP7 rendering on mw2396 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:57] RECOVERY - Apache HTTP on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.827 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:57] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:54:57] RECOVERY - Apache HTTP on mw2253 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:58] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10Wikimedia-Incident: Can not access Wikidata - https://phabricator.wikimedia.org/T290373 (10Sunny00217) It doesn't only show wikidata, it can show at zhwiki ( https://zh.wikipedia.org/w/api.php?action=query&format=json&meta=tokens&type=csrf ) and commons ( http... [02:54:59] RECOVERY - Apache HTTP on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.342 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:59] RECOVERY - PHP7 rendering on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:59] RECOVERY - Apache HTTP on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.161 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:54:59] RECOVERY - PHP7 rendering on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:54:59] RECOVERY - PHP7 rendering on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.968 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:00] RECOVERY - PHP7 rendering on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.426 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:01] RECOVERY - Apache HTTP on mw2328 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.973 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:01] RECOVERY - PHP7 rendering on mw2261 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.540 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:03] RECOVERY - PHP7 rendering on mw2291 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:03] RECOVERY - PHP7 rendering on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.376 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:03] RECOVERY - Apache HTTP on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.308 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:05] RECOVERY - PHP7 rendering on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.742 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:05] RECOVERY - Apache HTTP on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:05] RECOVERY - PHP7 rendering on mw2400 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.309 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:07] RECOVERY - PHP7 rendering on mw2366 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:07] RECOVERY - PHP7 rendering on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:07] RECOVERY - PHP7 rendering on mw2405 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.849 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:07] RECOVERY - PHP7 rendering on mw2302 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:07] RECOVERY - PHP7 rendering on mw2284 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.475 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:09] RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.808 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:09] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.481 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:09] RECOVERY - Apache HTTP on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.311 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:11] RECOVERY - PHP7 rendering on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.508 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:11] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:55:11] RECOVERY - PHP7 rendering on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:11] RECOVERY - High average POST latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=POST [02:55:13] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:13] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:13] RECOVERY - PHP7 rendering on mw2402 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:13] RECOVERY - Apache HTTP on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.230 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:14] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:55:15] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:15] RECOVERY - Apache HTTP on mw2324 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.310 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:17] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:17] RECOVERY - Apache HTTP on mw2321 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.980 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:19] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:21] RECOVERY - Apache HTTP on mw2354 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:25] RECOVERY - Apache HTTP on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.999 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:25] RECOVERY - Apache HTTP on mw2404 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.662 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:27] RECOVERY - Apache HTTP on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.813 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:27] RECOVERY - PHP7 rendering on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.372 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:27] RECOVERY - Apache HTTP on mw2308 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.500 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:27] RECOVERY - PHP7 rendering on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:29] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:55:29] RECOVERY - PHP7 rendering on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.534 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:29] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:55:31] RECOVERY - Apache HTTP on mw2261 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.900 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:31] RECOVERY - PHP7 rendering on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.341 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:31] RECOVERY - Apache HTTP on mw2356 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.551 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:31] RECOVERY - Apache HTTP on mw2405 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.563 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:31] RECOVERY - Apache HTTP on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.630 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:33] RECOVERY - Apache HTTP on mw2402 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.710 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:33] RECOVERY - Apache HTTP on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.382 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:55:34] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:55:34] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:55:41] PROBLEM - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=eventgate-analytics-external.svc.codfw.wmnet, port=4692): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Event_Platform/E [02:55:45] RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.928 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:55:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:56:01] RECOVERY - Apache HTTP on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.375 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:01] RECOVERY - Apache HTTP on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.493 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:01] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:01] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:03] RECOVERY - Apache HTTP on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.108 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:03] RECOVERY - PHP7 rendering on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:03] RECOVERY - Apache HTTP on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.481 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:03] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:04] RECOVERY - PHP7 rendering on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:05] RECOVERY - PHP7 rendering on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.709 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:05] RECOVERY - Apache HTTP on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.399 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:07] RECOVERY - Apache HTTP on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.354 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:07] RECOVERY - PHP7 rendering on mw2356 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.555 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:08] PROBLEM - MariaDB Replica SQL: s4 #page on db2137 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:56:08] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:08] RECOVERY - PHP7 rendering on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.971 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:17] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:17] RECOVERY - PHP7 rendering on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.553 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:21] RECOVERY - Apache HTTP on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.884 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:25] RECOVERY - PHP7 rendering on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.458 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:25] RECOVERY - Apache HTTP on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.474 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:33] RECOVERY - Apache HTTP on mw2289 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.871 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:56:44] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10Wikimedia-Incident: Can not access Wikidata - https://phabricator.wikimedia.org/T290373 (10MBH) I can't even add a page ( https://ru.wikipedia.org/wiki/Воук_%28политика%29 ) to my watchlist. Error 503 or "server doesn't responding". [02:56:45] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:47] RECOVERY - PHP7 rendering on mw2289 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.865 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:56:57] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [02:57:03] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [02:57:12] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5176 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver [02:57:15] RECOVERY - eventgate-analytics-external LVS codfw on eventgate-analytics-external.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [02:57:21] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:57:37] RECOVERY - PHP7 rendering on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.624 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:57:43] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:09] RECOVERY - LVS eventgate-analytics-external codfw port 4692/tcp - EventGate analytics external endpoint- eventgate-analytics-external.svc.codfw.wmnet and intake-analytics.wikimedia.org IPv4 on eventgate-analytics-external.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 668 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:58:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:58:33] RECOVERY - Apache HTTP on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.381 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:58:43] PROBLEM - PHP7 rendering on mw2318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:58:43] PROBLEM - PHP7 rendering on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:58:55] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:59:15] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:59:25] PROBLEM - Apache HTTP on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:27] RECOVERY - Apache HTTP on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.768 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:59:29] PROBLEM - PHP7 rendering on mw2299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:35] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:59:43] PROBLEM - Apache HTTP on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:49] PROBLEM - PHP7 rendering on mw2362 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:51] PROBLEM - PHP7 rendering on mw2304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:51] PROBLEM - PHP7 rendering on mw2404 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:51] PROBLEM - PHP7 rendering on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:51] PROBLEM - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:55] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:59:57] PROBLEM - Apache HTTP on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:57] PROBLEM - Apache HTTP on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:57] PROBLEM - PHP7 rendering on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:59] PROBLEM - Apache HTTP on mw2403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:59:59] PROBLEM - PHP7 rendering on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:59:59] PROBLEM - Apache HTTP on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:01] PROBLEM - PHP7 rendering on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:01] PROBLEM - PHP7 rendering on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:04] PROBLEM - MariaDB Replica SQL: s4 #page on db2137 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:00:04] PROBLEM - Apache HTTP on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:05] PROBLEM - PHP7 rendering on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:05] PROBLEM - PHP7 rendering on mw2297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:07] PROBLEM - PHP7 rendering on mw2302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:07] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:07] PROBLEM - PHP7 rendering on mw2319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:09] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [03:00:11] RECOVERY - Varnish has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/fe494e83d04fee66c8f0958bfc28451f [03:00:13] PROBLEM - PHP7 rendering on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:13] PROBLEM - Apache HTTP on mw2290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:13] PROBLEM - Apache HTTP on mw2324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:13] PROBLEM - PHP7 rendering on mw2402 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:19] PROBLEM - Apache HTTP on mw2321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:27] PROBLEM - Apache HTTP on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:29] PROBLEM - Apache HTTP on mw2334 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:29] PROBLEM - PHP7 rendering on mw2401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:29] PROBLEM - PHP7 rendering on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:33] PROBLEM - Apache HTTP on mw2352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:33] PROBLEM - PHP7 rendering on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:00:35] PROBLEM - Apache HTTP on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:00:47] PROBLEM - PHP7 rendering on mw2306 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:03] PROBLEM - Apache HTTP on mw2399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:05] PROBLEM - Apache HTTP on mw2293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:05] PROBLEM - PHP7 rendering on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:05] PROBLEM - Apache HTTP on mw2323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:05] PROBLEM - PHP7 rendering on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:05] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:07] PROBLEM - PHP7 rendering on mw2283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:09] PROBLEM - Apache HTTP on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:09] PROBLEM - PHP7 rendering on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:13] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:17] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:17] PROBLEM - PHP7 rendering on mw2326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:17] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:19] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:19] PROBLEM - Apache HTTP on mw2398 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:19] PROBLEM - Apache HTTP on mw2358 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:23] PROBLEM - PHP7 rendering on mw2300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:24] PROBLEM - Apache HTTP on mw2295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:29] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [03:01:33] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:01:43] RECOVERY - Apache HTTP on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.301 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:01:44] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10Wikimedia-Incident: Can not access Wikidata - https://phabricator.wikimedia.org/T290373 (10Legoktm) We have an ongoing outage right now and are investigating. [03:01:49] RECOVERY - PHP7 rendering on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.501 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:53] RECOVERY - Apache HTTP on mw2403 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.503 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:01:55] RECOVERY - Apache HTTP on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.642 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:01:55] RECOVERY - PHP7 rendering on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.803 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:55] RECOVERY - PHP7 rendering on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.399 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:57] PROBLEM - Apache HTTP on mw2298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:01:59] RECOVERY - PHP7 rendering on mw2288 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.712 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:01] RECOVERY - Apache HTTP on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:01] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:01] RECOVERY - PHP7 rendering on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.921 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:04] RECOVERY - MariaDB Replica SQL: s4 #page on db2137 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:02:07] RECOVERY - MariaDB Replica Lag: s4 #page on db2137 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:02:08] RECOVERY - PHP7 rendering on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.720 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:09] RECOVERY - PHP7 rendering on mw2402 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.192 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:11] RECOVERY - Apache HTTP on mw2291 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.539 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:12] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:02:13] RECOVERY - Apache HTTP on mw2322 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.470 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:15] RECOVERY - Apache HTTP on mw2321 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:15] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:21] RECOVERY - PHP7 rendering on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.665 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:25] RECOVERY - Apache HTTP on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.549 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:25] RECOVERY - Apache HTTP on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 9.258 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:25] RECOVERY - Apache HTTP on mw2352 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.317 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:27] RECOVERY - PHP7 rendering on mw2286 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.587 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:27] RECOVERY - Apache HTTP on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.913 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:29] RECOVERY - Apache HTTP on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.424 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:29] RECOVERY - PHP7 rendering on mw2334 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 7.567 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:29] RECOVERY - PHP7 rendering on mw2401 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 9.707 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:31] RECOVERY - Apache HTTP on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.197 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:32] !log rzl@cumin2001 dbctl commit (dc=all): 'Depool db2137:3314', diff saved to https://phabricator.wikimedia.org/P17210 and previous config saved to /var/cache/conftool/dbconfig/20210904-030231-rzl.json [03:02:33] RECOVERY - Apache HTTP on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 8.488 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:35] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:37] RECOVERY - PHP7 rendering on mw2397 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.678 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:39] RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.329 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:39] RECOVERY - Apache HTTP on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 6.389 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:39] RECOVERY - PHP7 rendering on mw2350 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.951 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:39] RECOVERY - Apache HTTP on mw2397 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.986 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:43] RECOVERY - PHP7 rendering on mw2306 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:45] RECOVERY - Apache HTTP on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.575 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:47] RECOVERY - Apache HTTP on mw2372 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.693 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:53] RECOVERY - PHP7 rendering on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 3.403 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:55] RECOVERY - Apache HTTP on mw2304 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 3.961 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:57] RECOVERY - Apache HTTP on mw2293 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.667 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:02:57] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:59] RECOVERY - PHP7 rendering on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 1.575 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:59] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:59] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:59] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:59] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:02:59] RECOVERY - Apache HTTP on mw2323 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:00] RECOVERY - PHP7 rendering on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.372 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:01] RECOVERY - Apache HTTP on mw2399 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:01] RECOVERY - Apache HTTP on mw2317 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.334 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:01] RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:02] RECOVERY - PHP7 rendering on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.679 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:03] RECOVERY - PHP7 rendering on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 4.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:03] RECOVERY - Apache HTTP on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 2.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:03] RECOVERY - PHP7 rendering on mw2317 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.818 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:05] RECOVERY - PHP7 rendering on mw2322 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:07] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:09] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:03:09] RECOVERY - PHP7 rendering on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:11] RECOVERY - Apache HTTP on mw2358 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:11] RECOVERY - Apache HTTP on mw2398 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:11] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:11] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:13] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [03:03:15] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:15] RECOVERY - Apache HTTP on mw2295 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:15] RECOVERY - PHP7 rendering on mw2300 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:17] RECOVERY - Apache HTTP on mw2283 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.658 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:17] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:03:19] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:03:21] RECOVERY - PHP7 rendering on mw2299 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:21] RECOVERY - PHP7 rendering on mw2328 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 2.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:27] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [03:03:27] RECOVERY - Apache HTTP on mw2294 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.220 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:29] RECOVERY - Apache HTTP on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 1.316 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:29] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:29] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:29] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:31] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:33] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [03:03:37] RECOVERY - Apache HTTP on mw2285 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:39] RECOVERY - PHP7 rendering on mw2372 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.415 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:43] RECOVERY - PHP7 rendering on mw2404 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:43] RECOVERY - PHP7 rendering on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:43] RECOVERY - PHP7 rendering on mw2362 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:43] RECOVERY - Apache HTTP on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:47] RECOVERY - Apache HTTP on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:51] RECOVERY - Apache HTTP on mw2328 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.478 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:51] RECOVERY - Apache HTTP on mw2298 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:53] RECOVERY - Apache HTTP on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:55] RECOVERY - PHP7 rendering on mw2294 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 1.012 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:55] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:57] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:57] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:59] RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:59] RECOVERY - Apache HTTP on mw2326 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:03:59] RECOVERY - PHP7 rendering on mw2287 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:59] RECOVERY - PHP7 rendering on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.809 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:01] RECOVERY - PHP7 rendering on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:01] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:01] RECOVERY - PHP7 rendering on mw2302 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.474 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:03] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [03:04:05] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24866 bytes in 0.977 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:04:05] RECOVERY - Apache HTTP on mw2290 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.665 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:04:06] RECOVERY - Apache HTTP on mw2324 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.753 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:04:09] RECOVERY - PHP7 rendering on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:11] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:04:11] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:04:11] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:04:15] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 0.9517 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:04:19] RECOVERY - Apache HTTP on mw2302 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.423 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:04:19] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:04:23] RECOVERY - PHP7 rendering on mw2262 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:33] RECOVERY - PHP7 rendering on mw2324 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.719 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:33] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:04:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:04:53] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:05:11] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at codfw #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5617 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [03:06:19] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:06:31] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [03:06:39] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:06:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:07:41] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:08:07] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:08:27] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04688 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:12:47] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:22] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10Wikimedia-Incident: Can not access Wikidata - https://phabricator.wikimedia.org/T290373 (10Legoktm) Things should be back to normal now - if it's not, please let us know. [03:14:07] 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10Legoktm) [03:14:45] 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10Legoktm) [03:15:03] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:27:39] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:28:46] 10SRE, 10Wikimedia-Incident: 2021-09-03 General MediaWiki outage - https://phabricator.wikimedia.org/T290373 (10colewhite) 05Open→03Resolved >>! In T290373#7332722, @Legoktm wrote: > Things should be back to normal now - if it's not, please let us know. Resolving for now ^ [03:30:53] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:07] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:03:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:25] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:01] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:15:57] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:41] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:25:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:39] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:42] (03PS1) 10Bstorm: cloud NFS: tighten up the traffic shaping a little [puppet] - 10https://gerrit.wikimedia.org/r/717948 (https://phabricator.wikimedia.org/T290318) [04:39:21] (03CR) 10Bstorm: [C: 03+2] cloud NFS: tighten up the traffic shaping a little [puppet] - 10https://gerrit.wikimedia.org/r/717948 (https://phabricator.wikimedia.org/T290318) (owner: 10Bstorm) [04:41:39] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:19] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:21] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:50:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:49] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.44 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:05:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:43] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:10:59] RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 0.9745 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:14:45] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:07] (03PS1) 10Bstorm: cloud nfs: disable scanning toolforge tool directories [puppet] - 10https://gerrit.wikimedia.org/r/717996 (https://phabricator.wikimedia.org/T290375) [05:23:39] (03CR) 10Bstorm: [C: 03+2] cloud nfs: disable scanning toolforge tool directories [puppet] - 10https://gerrit.wikimedia.org/r/717996 (https://phabricator.wikimedia.org/T290375) (owner: 10Bstorm) [05:24:17] (03PS1) 10Majavah: tool: Read name prefix from /etc/wmcs-project [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/718004 (https://phabricator.wikimedia.org/T290325) [05:45:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:11] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:10:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:27] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:05] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:33:47] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:34:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:51] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The following units failed: session-190160.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:17] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:07] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:57] (03PS11) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [08:11:53] (03CR) 10Majavah: Route Grid engine web requests via Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [08:13:34] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) a:03kostajh [08:13:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:23] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:11] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:39:33] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:40:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:02] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: catch unhandled exception [cookbooks] - 10https://gerrit.wikimedia.org/r/717475 (https://phabricator.wikimedia.org/T290326) (owner: 10Volans) [09:00:31] !log `systemctl reset-failed ifup@ens6.service` on puppetdb2002 - T273026 [09:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:36] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [09:00:49] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:59] !log restart wmf_auto_restart_rsyslog.service on puppetdb1002 [09:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:19] RECOVERY - Check systemd state on puppetdb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:55] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:23:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:09] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 229 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:37:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:25] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 216 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:43:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:37] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:33] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:03] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:07:09] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:08:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:57] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:14:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:07] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Yann) 2 more files which were not in htt... [10:19:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:13] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:24] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) @Sj Sadly, we think we solved (private) backups, but we decided, for the scope of this task, to not solve dumps becaus... [10:36:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:05] 10SRE, 10DBA, 10Wikimedia-Incident: enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Majavah) There indeed was an alert for text request volume that matches your timing: `lang=irc 10:45:57 (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes... [11:03:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:08] 10SRE, 10DBA, 10Traffic, 10Wikimedia-Incident: enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Marostegui) I'm on my phone but just to mention that the queries dropping is probably a consequence of something else and not the consequence. There's a huge spike and then the drop,... [11:37:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:05] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:41:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:41] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:49] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:02] 10SRE, 10DBA, 10Traffic, 10Wikimedia-Incident: enwiki was down at 10:44 (UTC) - https://phabricator.wikimedia.org/T290379 (10Marostegui) From what I can see all the API enwiki hosts got this query around the reported time which is pretty crazy (hiding the query for obvious reasons): {P17211} Pasting the e... [12:17:53] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 5%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17212 and previous config saved to /var/cache/conftool/dbconfig/20210904-122014-root.json [12:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:01] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:21] PROBLEM - Check systemd state on ms-be2048 is CRITICAL: CRITICAL - degraded: The following units failed: session-27362.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:29] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:35:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 10%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17213 and previous config saved to /var/cache/conftool/dbconfig/20210904-123518-root.json [12:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:47] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:04] RECOVERY - Check systemd state on ms-be2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 25%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17214 and previous config saved to /var/cache/conftool/dbconfig/20210904-125021-root.json [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:54:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:33] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:11] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:41] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:05:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 50%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17215 and previous config saved to /var/cache/conftool/dbconfig/20210904-130525-root.json [13:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:39] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 75%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17216 and previous config saved to /var/cache/conftool/dbconfig/20210904-132029-root.json [13:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:35:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 100%: Slowly repool T290374', diff saved to https://phabricator.wikimedia.org/P17217 and previous config saved to /var/cache/conftool/dbconfig/20210904-133532-root.json [13:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:59] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:07] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:45:09] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:43] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: statograph_post.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:22] I'll silence this until monday ^ [13:56:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:25] Ty godog [13:57:26] RhinosF1: cheers [13:59:02] Np [14:21:13] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:15:11] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:59] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:59] PROBLEM - Disk space on maps2009 is CRITICAL: DISK CRITICAL - free space: / 2722 MB (3% inode=98%): /tmp 2722 MB (3% inode=98%): /var/tmp 2722 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2009&var-datasource=codfw+prometheus/ops [18:56:45] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:31] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:21] PROBLEM - Check systemd state on an-worker1140 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:23] PROBLEM - Hadoop NodeManager on an-worker1140 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:58:29] RECOVERY - Check systemd state on an-worker1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:59:25] RECOVERY - Hadoop NodeManager on an-worker1140 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:07:11] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:39] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:33] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:13] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:51:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down