[00:01:01] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [00:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:38] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [00:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:55] (03PS1) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) [00:07:21] (03PS2) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) [00:08:30] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [00:08:43] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [00:09:39] (03PS3) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) [00:14:02] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:29] (03PS10) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:16:33] (03PS11) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:17:31] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [00:19:38] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:35:34] (03CR) 10Ryan Kemper: "This should be a no-op, but PCC is failing on wdqs1003:" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:38:38] (03PS12) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:42:39] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:05] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:52:07] (03CR) 10Ryan Kemper: "Figured it out, just a small syntax error that had been introduced into hieradata/role/eqiad/wdqs/internal.yaml (now fixed)" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [00:52:12] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse) [01:08:17] (03PS1) 10Ryan Kemper: query_service: fix newly broken gc-log-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/721646 [01:11:22] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721646 (owner: 10Ryan Kemper) [01:20:19] PROBLEM - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:34:57] (03PS1) 10Ryan Kemper: elasticsearch: cleanup absented cron resources [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) [01:35:42] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [01:40:50] (03CR) 10Ryan Kemper: "@Ebernhardson - tagging you for just the 'default' vs default stuff since it was introduced in this commit of yours https://github.com/wik" [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [01:42:25] (03CR) 10Ryan Kemper: wdqs: remove codfw hourly restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [01:42:38] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [01:43:08] (03CR) 10RLazarus: [C: 03+1] "Belatedly - looks great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/720817 (https://phabricator.wikimedia.org/T285298) (owner: 10Ahmon Dancy) [01:48:00] !log T290330 [Remove WDQS codfw ~hourly restarts] `sudo cumin 'C:query_service::crontasks' 'sudo disable-puppet "Stop doing wdqs codfw ~hourly restarts - T290330"'` [01:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:06] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [01:49:44] PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:10] !log T290330 [Remove WDQS codfw ~hourly restarts] Testing on arbitrary codfw host: `ryankemper@wdqs2001:~$ sudo run-puppet-agent` [01:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:16] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [01:56:09] The `wdqs2004` is because I made the mistake of merging before running the "disable puppet" command, and `wdqs2004` happened to run puppet in that time period [01:57:05] (03CR) 10SDineshKumar: [C: 03+1] elasticsearch: cleanup absented cron resources (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper) [01:59:08] RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:58] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:12] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:50] (03CR) 10Cwhite: [C: 03+1] logstash: make jmx_ params optional [puppet] - 10https://gerrit.wikimedia.org/r/721370 (owner: 10Herron) [02:22:00] !log T290330 [Remove WDQS codfw ~hourly restarts] `wdqs2001` and `wdqs2004` look fine after running `sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer` to clean up dangling timer [02:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:07] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [02:25:14] (03CR) 10Cwhite: [C: 03+1] logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 (owner: 10Herron) [02:25:40] (03CR) 10Cwhite: [C: 03+1] logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 (owner: 10Herron) [02:28:39] !log T290330 [Remove WDQS codfw ~hourly restarts] Successfully rolled out to rest of fleet `sudo cumin 'C:query_service::crontasks' 'sudo run-puppet-agent --force && sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer'` [02:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:48] T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330 [02:29:15] (03CR) 10Cwhite: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [02:35:35] (03Abandoned) 10Cwhite: profile, elasticsearch: add option to configure systemd Before= value [puppet] - 10https://gerrit.wikimedia.org/r/666231 (https://phabricator.wikimedia.org/T275405) (owner: 10Cwhite) [03:08:22] PROBLEM - Apache HTTP on wtp1037 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1872 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:10:12] RECOVERY - Apache HTTP on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [04:42:38] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4636168200 and 48875 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:43:58] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1698401776 and 48955 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:44:16] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 665290288 and 276 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:44:52] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1388724288 and 49009 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:45:26] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1172109200 and 493 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:45:52] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 516600 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:46:10] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216 and 105 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:46:26] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 280 and 122 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:46:46] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 392 and 141 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:47:20] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 192 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:10:36] (03CR) 10Marostegui: "This change was done yesterday (pending codfw)" [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [05:12:42] (03PS1) 10Marostegui: install_server: Reimage db2103 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/721652 (https://phabricator.wikimedia.org/T290865) [05:13:36] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2103 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/721652 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [05:43:46] (03PS4) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [06:01:48] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:04:10] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 78029276960 and 1432 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:04:42] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 81727635528 and 1465 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:05:38] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 82965504040 and 1522 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:06:16] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 85065977400 and 1558 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:06:50] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 87124828744 and 1593 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:33:58] 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10elukey) [06:34:18] ACKNOWLEDGEMENT - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Elukey T291237 [06:53:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete java::security [puppet] - 10https://gerrit.wikimedia.org/r/719261 (https://phabricator.wikimedia.org/T282454) (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210917T0700) [07:03:28] (03CR) 10DCausse: [C: 03+1] "sorry for this typo & thanks for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [07:05:49] (03PS1) 10Elukey: profile::analytics::httpd: disable CGI [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355) [07:06:47] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [07:07:36] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/721549 (owner: 10Muehlenhoff) [07:11:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm) [07:18:24] (03CR) 10Ema: [C: 03+2] VarnishTrafficDrop: fix site label in summary [alerts] - 10https://gerrit.wikimedia.org/r/721507 (https://phabricator.wikimedia.org/T291149) (owner: 10BBlack) [07:24:39] (03CR) 10Muehlenhoff: [C: 03+1] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper) [07:26:47] (Traffic bill over quota) resolved: (2) Traffic bill over quota - https://alerts.wikimedia.org [07:27:46] PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:27:46] PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:27:46] PROBLEM - PHP7 rendering on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:27:48] PROBLEM - Apache HTTP on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:27:52] PROBLEM - PHP7 rendering on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:27:58] PROBLEM - PHP7 rendering on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:02] PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:06] PROBLEM - PHP7 rendering on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:10] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:10] PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:18] PROBLEM - PHP7 rendering on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:22] PROBLEM - Apache HTTP on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:23] PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:23] PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:23] PROBLEM - Apache HTTP on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:24] PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:29] PROBLEM - PHP7 rendering on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:29] PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:29] PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:29] PROBLEM - PHP7 rendering on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:30] PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:28:42] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:28:42] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:28:42] PROBLEM - Apache HTTP on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:46] PROBLEM - PHP7 rendering on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:48] PROBLEM - PHP7 rendering on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:50] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.228:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: [07:28:50] on connection while downloading http://10.64.48.228:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:28:52] PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:53] PROBLEM - Apache HTTP on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:53] PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:53] PROBLEM - PHP7 rendering on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:28:53] PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:28:53] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:28:53] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:28:58] PROBLEM - Apache HTTP on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:06] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:06] PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:06] PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:07] PROBLEM - PHP7 rendering on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:07] PROBLEM - PHP7 rendering on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:07] PROBLEM - PHP7 rendering on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:12] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 3078 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:29:14] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:14] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:14] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:14] PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:15] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform w [07:29:15] to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [07:29:16] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:29:16] PROBLEM - PHP7 rendering on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:20] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:22] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:24] PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:24] PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:29:24] PROBLEM - Apache HTTP on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:28] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:29:32] PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:29:36] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:36] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:36] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:36] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:44] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase1022.eqiad.wmnet, restbase1025.eqiad.wmnet, restbase1021.eqiad.wmnet, restbase1027.eqiad.wmnet, restbase1018.eqiad.wmnet, restbase1023.eqiad.wmnet, restbase1024.eqiad.wmnet, restbase1016.eqiad.wmnet, restbase1026.eqiad.wmnet, restbase1029.eqiad.wmnet, restbase1030.eqiad.wmnet, restbase1017.eqiad.wmnet, restbase102 [07:29:44] wmnet are marked down but pooled: parsoid-php_443: Servers wtp1029.eqiad.wmnet, wtp1048.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1041.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1030.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1047.eqiad.wmnet are marked down but pooled [07:29:44] wikitech.wikimedia.org/wiki/PyBal [07:29:50] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase1025.eqiad.wmnet, restbase1021.eqiad.wmnet, restbase1027.eqiad.wmnet, restbase1018.eqiad.wmnet, restbase1024.eqiad.wmnet, restbase1016.eqiad.wmnet, restbase1026.eqiad.wmnet, restbase1029.eqiad.wmnet, restbase1019.eqiad.wmnet, restbase1023.eqiad.wmnet, restbase1017.eqiad.wmnet, restbase1028.eqiad.wmnet, restbase102 [07:29:50] wmnet are marked down but pooled: parsoid-php_443: Servers wtp1048.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1029.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp10 [07:29:50] .wmnet, wtp1041.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1030.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:29:52] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:52] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:29:52] PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:30:00] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.16.173:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: [07:30:00] on connection while downloading http://10.64.16.173:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:00] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: [07:30:00] on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:04] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:06] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:06] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:30:08] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:30:10] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:12] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:13] whattt [07:30:14] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:14] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:15] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:30:16] PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:30:16] PROBLEM - PHP7 rendering on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:30:40] (03PS3) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [07:30:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:30:56] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:01] _joe_ jayme akosiaris --^ [07:31:02] what is going on here? :o [07:31:04] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [07:31:04] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:14] RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 9.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:16] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:31:20] RECOVERY - PHP7 rendering on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 651 bytes in 2.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:22] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:23] whatt [07:31:24] RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 5.962 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:29] RECOVERY - Apache HTTP on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 7.597 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:30] RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:36] jayme: seemed all wtp-nodes related [07:31:37] you jinxed it elukey [07:31:44] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_restbase_codfw,swagger_check_restbase_eqiad,swagger_check_restbase_esams} site={codfw,eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:31:48] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:31:50] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:31:52] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 9.909 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:52] RECOVERY - PHP7 rendering on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:54] RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:54] RECOVERY - PHP7 rendering on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:54] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:54] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:54] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:55] RECOVERY - Apache HTTP on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 0.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:56] RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 0.853 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:31:56] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:56] RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 651 bytes in 1.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:57] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:57] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:31:58] RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:58] RECOVERY - Apache HTTP on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 2.273 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:02] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [07:32:02] RECOVERY - PHP7 rendering on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:02] RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:02] RECOVERY - PHP7 rendering on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:02] RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:03] RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:14] RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:16] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:17] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:18] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:32:19] RECOVERY - PHP7 rendering on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:22] RECOVERY - PHP7 rendering on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:24] <_joe_> ok I was about to say please someone else deal with it [07:32:26] RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:26] RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:26] RECOVERY - Apache HTTP on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:26] RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:26] RECOVERY - PHP7 rendering on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:27] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:28] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:28] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:32] RECOVERY - Apache HTTP on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:38] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:39] RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:39] RECOVERY - PHP7 rendering on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:39] RECOVERY - PHP7 rendering on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:39] RECOVERY - PHP7 rendering on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:40] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:40] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:46] RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:48] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:32:49] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:49] RECOVERY - PHP7 rendering on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:49] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:49] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:49] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:49] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:32:50] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:32:54] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:32:56] RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:32:56] RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:32:56] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:32:58] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:04] RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:33:10] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:12] RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:33:12] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:12] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:16] RECOVERY - PHP7 rendering on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:33:22] RECOVERY - PHP7 rendering on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:33:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:33:26] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:26] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:28] RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:33:30] RECOVERY - PHP7 rendering on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:33:32] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:33:34] RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:33:36] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:33:39] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:34:22] (03CR) 10Jelto: services: deploy services with helm3 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [07:35:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [07:47:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [07:49:56] (03PS3) 10Jcrespo: mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) [07:53:05] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:56:02] (03PS1) 10Marostegui: report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760 [07:56:40] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760 (owner: 10Marostegui) [07:57:12] (03Merged) 10jenkins-bot: report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760 (owner: 10Marostegui) [08:00:08] !log restarting php-fpm on wtp1037 and wtp1030 [08:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:24] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:13] (03CR) 10Elukey: [C: 03+2] profile::analytics::httpd: disable CGI [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey) [08:11:58] jynus: o/ ok to puppet-merge? [08:12:06] ups, sorry [08:12:12] did it get merged? [08:12:16] (03PS1) 10Marostegui: report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763 [08:12:23] but yes, it is a noop [08:12:46] ack proceeding :) [08:12:55] I thought I only +2d it and I was waiting for CI [08:13:02] sorry [08:13:10] (03CR) 10Marostegui: [C: 03+2] report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763 (owner: 10Marostegui) [08:13:16] jynus: no problem at all! [08:14:02] (03Merged) 10jenkins-bot: report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763 (owner: 10Marostegui) [08:50:42] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) To keep archives happy - we currently don't deploy `systemd-coredump` on our hosts (because of the reasons highlighted above), so the dumps are not... [08:52:10] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) [08:56:06] (03PS5) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [08:57:05] (03PS6) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [09:19:00] !log milimetric@deploy1002 Started deploy [analytics/refinery@37e904a]: Only syncing sanitize allowlist [09:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:48] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:29:52] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:29:52] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:30:16] Hi there, I've noticed a big/weird increase in traffic to citoid starting this Tuesday. https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-7d&to=now [09:30:43] Notably we're getting a lot of requests for the "zotero" format which to my knowledge is not used by us at all. [09:31:21] So I suspect this big increase is probably foreign traffic. Might be worth checking out ahead of time though it hasn't broken anything yet (as far as I know!) [09:31:38] In case it is or will be a problem [09:36:43] !log milimetric@deploy1002 Finished deploy [analytics/refinery@37e904a]: Only syncing sanitize allowlist (duration: 17m 43s) [09:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:20] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:37:23] !log milimetric@deploy1002 Started deploy [analytics/refinery@37e904a] (thin): Only syncing sanitize allowlist, deploying THIN for consistency [09:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:26] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:37:30] !log milimetric@deploy1002 Finished deploy [analytics/refinery@37e904a] (thin): Only syncing sanitize allowlist, deploying THIN for consistency (duration: 00m 07s) [09:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:16] (03PS2) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [09:39:16] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:39:20] 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) 05Open→03Resolved a:03ema >>! In T291149#7361022, @gerritbot wrote: > %%%[operations/alerts@master] VarnishTrafficDrop: fix site label in summary%%% >... [09:39:51] (03CR) 10jerkins-bot: [V: 04-1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [09:40:49] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) Thanks @MBinder_WMF, I've reached out on Slack as well to verify this via another channel (can't be too careful). Reply back there whenever you've a moment. Disappointed... [09:45:20] (03PS3) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [09:45:48] (03PS1) 10Elukey: mtail: add counter for kernel traps [puppet] - 10https://gerrit.wikimedia.org/r/721773 (https://phabricator.wikimedia.org/T246470) [09:47:27] (03CR) 10Tobias Andersson: [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [09:51:32] mvolz: the services dc switchover was on Tuesday, if you look at https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-7d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid you can see how the request volume switched over [09:57:44] (03CR) 10Lucas Werkmeister (WMDE): "I’m getting CSP errors when trying this:" [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [09:59:39] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10aborrero) [10:00:17] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) My current plan for cluster-wise migration and deploying services with helm3 is: * make sure cluster is depooled * delete helm releases for all services * remove tiller compon... [10:00:42] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:01:43] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:06:58] 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) >>! In T290870#7351871, @fgiunchedi wrote: > I think either approach will work as a bandaid, intuitively a rsyslog-native solution seems better to me (assumi... [10:07:12] 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) p:05Triage→03Medium [10:07:56] (03Abandoned) 10Ema: rsyslog: config sanity check as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/720913 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema) [10:18:37] (03PS7) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [10:27:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch seems correct, but given how convoluted envoy configs are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli) [10:27:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [10:32:58] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [10:36:06] (03CR) 10Tobias Andersson: [C: 04-1] miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [10:39:33] (03PS4) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [10:40:52] (03PS8) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking [puppet] - 10https://gerrit.wikimedia.org/r/707371 [10:44:59] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) We reached at some points (with the dc depooled, during night) peaks of 150 files/s, but it got as low as 6 files/s fo... [10:48:02] (03PS10) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [10:51:41] (03CR) 10Jgiannelos: [C: 03+2] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [10:54:30] (03Merged) 10jenkins-bot: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [11:00:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:08:12] (03CR) 10Muehlenhoff: [C: 03+2] os-updates-report: Adapt to new OS tracking [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [11:08:57] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524537401864 and 19629 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:08:57] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524394009144 and 19614 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:08:57] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 526308446784 and 19684 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:08:57] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524394009144 and 19640 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:08:57] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524494148128 and 19694 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:09:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:01] (03PS1) 10Muehlenhoff: os tracking: Commit missíng data file [puppet] - 10https://gerrit.wikimedia.org/r/721788 [11:13:15] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:14:00] (03CR) 10Muehlenhoff: [C: 03+2] os tracking: Commit missíng data file [puppet] - 10https://gerrit.wikimedia.org/r/721788 (owner: 10Muehlenhoff) [11:14:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:07] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:28:13] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [11:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:59] (03CR) 10Lucas Werkmeister (WMDE): miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [11:34:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [12:05:48] (03PS1) 10Muehlenhoff: os-reports: Small followup fixes [puppet] - 10https://gerrit.wikimedia.org/r/721804 [12:06:58] 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) Ack, I'll upload to apt.wikimedia.org on Monday. [12:12:52] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Small followup fixes [puppet] - 10https://gerrit.wikimedia.org/r/721804 (owner: 10Muehlenhoff) [12:13:55] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [12:31:19] 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Nikola_Smolenski) If it is of any help, when I try to open the file with Adobe's acroread, it reports "stat buffer overflow" error, while evince opens it ni... [12:32:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) [12:44:00] (03PS4) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [12:45:46] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:46:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:47:35] (03CR) 10Tobias Andersson: [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [12:48:22] (03PS1) 10Muehlenhoff: Install swaks on mail servers [puppet] - 10https://gerrit.wikimedia.org/r/721809 [12:48:36] (03PS3) 10Muehlenhoff: Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) [12:50:10] (03CR) 10Ladsgroup: [C: 03+1] "I will hopefully get it deployed soon" [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup) [12:52:20] (03CR) 10Hashar: [C: 04-1] "That would install the package on the Jenkins agent, however we no more run commands directly from the agent. Instead the execution enviro" [puppet] - 10https://gerrit.wikimedia.org/r/720402 (owner: 10Ebernhardson) [12:57:27] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:06:22] !log installing 4.9.272 kernels on stretch hosts (no reboots yet) [13:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:22] (03PS1) 10Ladsgroup: snapshot: Change URL of xmldatadumps-l from mailman2 to mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/721811 (https://phabricator.wikimedia.org/T282303) [13:48:11] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10Perryprog) This issue has been reported on Znuny on Ticket#2021091710000804, so it's safe to assume the problem is still pres... [13:59:31] (03CR) 10Dzahn: [C: 03+2] "This also does things like git pulling deployment-charts etc..hmm.. not just a few files.. but testing it on the inactive server." [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [14:03:37] (03CR) 10Dzahn: "deployed, did not see issues with it" [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [14:12:54] (03PS1) 10Dzahn: thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818 [14:13:46] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31129/thumbor1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:16:53] (03PS3) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583) [14:21:10] 10SRE, 10Performance-Team, 10Thumbor, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) Moving back for re-triage as it's been dormant 6 months in a columnn for things "this quarter". [14:21:17] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:41] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:19] ACKNOWLEDGEMENT - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:19] ACKNOWLEDGEMENT - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:19] ACKNOWLEDGEMENT - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:32] (03PS1) 10Dzahn: thumbor: fix location of generate-thumbor-age-metrics-nc.sh [puppet] - 10https://gerrit.wikimedia.org/r/721823 [14:25:37] (03CR) 10Dzahn: [C: 03+2] thumbor: fix location of generate-thumbor-age-metrics-nc.sh [puppet] - 10https://gerrit.wikimedia.org/r/721823 (owner: 10Dzahn) [14:26:09] PROBLEM - Check systemd state on thumbor1004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:25] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:23] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:03] RECOVERY - Check systemd state on thumbor1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:21] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:53] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [14:29:21] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2856 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] (03CR) 10Elukey: [C: 04-1] "Had a very nice conversation with Joe about pros and cons of this approach, and a good suggestion that came up was to avoid duplicating hi" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [14:42:32] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10cmooney) Recent email exchange on this as they contacted us again: > From: Cathal Mooney [mailto:cmooney@wikimedia.org] > Se... [14:43:38] (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/c/operations/puppet/+/721823 but is ok now" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:44:07] (03PS2) 10Dzahn: thumbor: remove absented cron code for generate-thumbor-age-metrics [puppet] - 10https://gerrit.wikimedia.org/r/721589 (https://phabricator.wikimedia.org/T273673) [14:47:59] (03CR) 10Dzahn: [C: 03+2] thumbor: remove absented cron code for generate-thumbor-age-metrics [puppet] - 10https://gerrit.wikimedia.org/r/721589 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [14:49:57] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [14:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:21] (03PS6) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [14:50:57] (03CR) 10Ahmon Dancy: "Thanks Daniel Z!" [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [14:51:09] (03CR) 10Dzahn: "friendly reminder this still needs deployment" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe) [14:51:27] (03CR) 10Hnowlan: apt::package_from_component: add update condition for multiple packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721275 (owner: 10Hnowlan) [14:56:54] (03PS2) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) [14:56:56] (03PS1) 10Btullis: Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) [14:57:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [14:58:40] (03PS1) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850 [15:01:07] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31130/console" [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [15:06:02] 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) p:05Triage→03Medium a:03cmooney [15:09:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scaffold: add more options for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 (owner: 10Effie Mouzeli) [15:10:48] (03CR) 10Giuseppe Lavagetto: mediawiki::packages: remove libvips (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) (owner: 10Giuseppe Lavagetto) [15:11:47] (03PS3) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) [15:12:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "the diff in the CI output corresponds to what we expect" [deployment-charts] - 10https://gerrit.wikimedia.org/r/721328 (owner: 10Giuseppe Lavagetto) [15:12:23] (03CR) 10jerkins-bot: [V: 04-1] openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [15:12:34] (03PS1) 10Cathal Mooney: Replacing SSH pub key for mbinder as he rebuilt his laptop. [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141) [15:12:36] (03PS1) 10Dzahn: rancid: convert crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673) [15:13:29] (03PS2) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850 [15:15:17] (03PS1) 10Arturo Borrero Gonzalez: openstack: drop stale README file [puppet] - 10https://gerrit.wikimedia.org/r/721855 [15:16:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop stale README file [puppet] - 10https://gerrit.wikimedia.org/r/721855 (owner: 10Arturo Borrero Gonzalez) [15:16:25] (03Merged) 10jenkins-bot: mediawiki: fix longstanding bug with mcrouter cross-dc encryption [deployment-charts] - 10https://gerrit.wikimedia.org/r/721328 (owner: 10Giuseppe Lavagetto) [15:22:12] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow injecting the wmerrors script [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851) [15:24:42] (03PS1) 10ZPapierski: Add kafka clusters' brokers to spicerack config [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469) [15:24:46] (03PS3) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850 [15:32:00] (03PS22) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:32:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Output from CI seems correct." [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [15:37:36] (03PS23) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:37:45] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10cmooney) This was completed yesterday evening for all affected hosts, and all are now reporting an LLDP neighbour as exp... [15:38:08] (03CR) 10Elukey: [C: 04-1] "Still not ready" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:40:46] (03PS24) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:43:14] (03PS25) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:44:26] (03PS26) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:45:18] (03PS7) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [15:46:19] (03CR) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli) [15:47:40] (03CR) 10Elukey: "Joe: not sure if this is what you have in mind, it is still not 100% configurable via hiera (especially deployment_server.pp) but it is su" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [15:48:32] (03PS3) 10Effie Mouzeli: tegola-vector-tiles: use v0.3 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 [15:51:54] (03PS4) 10Effie Mouzeli: tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019 [15:53:45] (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818 (owner: 10Dzahn) [15:54:19] (03PS1) 10Ahmon Dancy: releases: Install private data for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629) [15:55:03] (03CR) 10Dzahn: [C: 03+2] thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818 (owner: 10Dzahn) [15:56:12] (03CR) 10Ahmon Dancy: [C: 03+1] releases: Install private data for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [16:02:20] (03PS5) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) [16:04:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:12] (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [16:10:33] (03PS1) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 [16:11:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:32] (03CR) 10jerkins-bot: [V: 04-1] fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli) [16:14:17] (03PS2) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 [16:25:16] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:42] 10SRE, 10NavigationTiming, 10Performance-Team, 10Patch-For-Review: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10dpifke) I've enabled the TLS listener in deployment-prep and confirmed the Navtiming patch works. Next up: Coal. [16:42:19] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) ` robh@labstore1005:~$ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq... [16:45:11] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1022.eqiad.wmnet ` The log can be found in... [16:48:15] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:48:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [16:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in... [16:59:19] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed [17:00:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: REIMAGE [17:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:06] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [17:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:24] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: REIMAGE [17:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:58] RECOVERY - Host cloudvirt1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [17:11:32] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1022.eqiad.wmnet'] ` and were **ALL** successful. [17:17:46] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) [17:18:52] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) Moved both of the servers to cloudsw1. cloudcephosd1022 was installed with zero issues and is now set to staged. Cloudcephosd1021 is still having issu... [17:33:35] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] ` [17:53:35] (03CR) 10Dzahn: [C: 03+2] "This is like existing data...so it's fine. Though "profile::kubernetes::deployment_server::services" is not actually a class that exists.." [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy) [18:27:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141) (owner: 10Cathal Mooney) [18:39:07] (03Abandoned) 10Ebernhardson: Include shellcheck on ci slave instances [puppet] - 10https://gerrit.wikimedia.org/r/720402 (owner: 10Ebernhardson) [18:42:47] (03CR) 10Herron: [C: 03+1] Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [18:43:09] (03CR) 10Herron: [C: 03+1] Install swaks on mail servers [puppet] - 10https://gerrit.wikimedia.org/r/721809 (owner: 10Muehlenhoff) [18:46:39] (03PS3) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) [18:48:31] (03CR) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [18:48:35] (03PS4) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) [18:54:06] (03CR) 10Herron: [V: 03+2 C: 03+2] slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [18:58:49] (03PS1) 10Herron: fix missing ' [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721882 [18:59:39] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:00:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [19:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:22] (03CR) 10Herron: [V: 03+2 C: 03+2] fix missing ' [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721882 (owner: 10Herron) [19:12:03] (03PS1) 10Herron: Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 [19:13:09] (03CR) 10RLazarus: [C: 03+1] Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 (owner: 10Herron) [19:14:08] (03PS2) 10Herron: Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 [19:14:35] (03PS1) 10Ssingh: durum: add motd ASCII art [puppet] - 10https://gerrit.wikimedia.org/r/721888 [19:16:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31132/console" [puppet] - 10https://gerrit.wikimedia.org/r/721888 (owner: 10Ssingh) [19:16:41] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add motd ASCII art [puppet] - 10https://gerrit.wikimedia.org/r/721888 (owner: 10Ssingh) [19:16:44] (03CR) 10Herron: [V: 03+2 C: 03+2] "that's what I get for deploying towards the end of friday! reverting, will revisit next week" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 (owner: 10Herron) [19:17:27] (03PS7) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [19:19:19] (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [19:19:50] (03CR) 10jerkins-bot: [V: 04-1] Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [19:40:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:41:28] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu) 05Open→03Resolved @akosiaris I approve the shell access for @mewoph. Sorry for the delay [19:41:32] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu) [19:41:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:42:07] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu) 05Resolved→03In progress [19:48:18] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10Urbanecm) a:05DMburugu→03None Resetting assignee and moving to untriaged to make sure clinic duty sees this. [19:53:32] (03PS1) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [19:55:03] (03CR) 10jerkins-bot: [V: 04-1] tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [19:56:09] (03PS4) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) [19:57:38] (03CR) 10jerkins-bot: [V: 04-1] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro) [20:03:37] (03PS2) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [20:04:28] (03PS8) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 [20:15:38] (03PS3) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [20:17:50] (03PS4) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) [20:21:46] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [21:10:10] (03PS8) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [21:13:19] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:16:08] (03PS9) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) [21:19:25] !log legoktm@cumin1001 START - Cookbook sre.dns.netbox [21:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:35] (03CR) 10Mxn: [C: 03+1] "Thank you for taking care of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [21:28:01] !log legoktm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:10] (03PS1) 10MusikAnimal: Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303) [22:03:27] (03PS1) 10Legoktm: Add LVS for new Shellboxes: media, syntaxhighlight & timeline [puppet] - 10https://gerrit.wikimedia.org/r/721904 (https://phabricator.wikimedia.org/T289226) [22:03:30] (03PS1) 10Legoktm: service: Switch new Shellboxes to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/721905 (https://phabricator.wikimedia.org/T289226) [22:03:34] (03PS1) 10Legoktm: service: Switch new Shellboxes to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/721906 (https://phabricator.wikimedia.org/T289226) [22:03:36] (03PS1) 10Legoktm: Add *.svc.{codfw,eqiad}.wmnet entries for new Shellboxes [dns] - 10https://gerrit.wikimedia.org/r/721908 (https://phabricator.wikimedia.org/T289226) [22:03:38] (03PS1) 10Legoktm: service: Switch new Shellboxes to production [puppet] - 10https://gerrit.wikimedia.org/r/721907 (https://phabricator.wikimedia.org/T289226) [22:03:40] (03PS1) 10Legoktm: Add new Shellboxes to discovery [dns] - 10https://gerrit.wikimedia.org/r/721909 (https://phabricator.wikimedia.org/T289226) [23:14:59] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:40:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:42:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:54:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down