[00:01:01] <logmsgbot>	 !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[00:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:38] <logmsgbot>	 !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[00:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:55] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198)
[00:07:21] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198)
[00:08:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[00:08:43] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[00:09:39] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198)
[00:14:02] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:29] <wikibugs>	 (03PS10) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:16:33] <wikibugs>	 (03PS11) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:17:31] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[00:19:38] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:35:34] <wikibugs>	 (03CR) 10Ryan Kemper: "This should be a no-op, but PCC is failing on wdqs1003:" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:38:38] <wikibugs>	 (03PS12) 10Ryan Kemper: wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:42:39] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:05] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:52:07] <wikibugs>	 (03CR) 10Ryan Kemper: "Figured it out, just a small syntax error that had been introduced into hieradata/role/eqiad/wdqs/internal.yaml (now fixed)" [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[00:52:12] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Prepare streaming updater settings [puppet] - 10https://gerrit.wikimedia.org/r/721280 (https://phabricator.wikimedia.org/T288231) (owner: 10DCausse)
[01:08:17] <wikibugs>	 (03PS1) 10Ryan Kemper: query_service: fix newly broken gc-log-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/721646
[01:11:22] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721646 (owner: 10Ryan Kemper)
[01:20:19] <icinga-wm>	 PROBLEM - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:34:57] <wikibugs>	 (03PS1) 10Ryan Kemper: elasticsearch: cleanup absented cron resources [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673)
[01:35:42] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper)
[01:40:50] <wikibugs>	 (03CR) 10Ryan Kemper: "@Ebernhardson - tagging you for just the 'default' vs default stuff since it was introduced in this commit of yours https://github.com/wik" [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper)
[01:42:25] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: remove codfw hourly restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[01:42:38] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper)
[01:43:08] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Belatedly - looks great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/720817 (https://phabricator.wikimedia.org/T285298) (owner: 10Ahmon Dancy)
[01:48:00] <ryankemper>	 !log T290330 [Remove WDQS codfw ~hourly restarts] `sudo cumin 'C:query_service::crontasks' 'sudo disable-puppet "Stop doing wdqs codfw ~hourly restarts - T290330"'`
[01:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:06] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[01:49:44] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:10] <ryankemper>	 !log T290330 [Remove WDQS codfw ~hourly restarts] Testing on arbitrary codfw host: `ryankemper@wdqs2001:~$ sudo run-puppet-agent`
[01:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:55:16] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[01:56:09] <ryankemper>	 The `wdqs2004` is because I made the mistake of merging before running the "disable puppet" command, and `wdqs2004` happened to run puppet in that time period
[01:57:05] <wikibugs>	 (03CR) 10SDineshKumar: [C: 03+1] elasticsearch: cleanup absented cron resources (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721647 (https://phabricator.wikimedia.org/T273673) (owner: 10Ryan Kemper)
[01:59:08] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:04:58] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-restart-hourly-w-random-delay.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:12] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] logstash: make jmx_ params optional [puppet] - 10https://gerrit.wikimedia.org/r/721370 (owner: 10Herron)
[02:22:00] <ryankemper>	 !log T290330 [Remove WDQS codfw ~hourly restarts] `wdqs2001` and `wdqs2004` look fine after running `sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer` to clean up dangling timer
[02:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:22:07] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[02:25:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 (owner: 10Herron)
[02:25:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 (owner: 10Herron)
[02:28:39] <ryankemper>	 !log T290330 [Remove WDQS codfw ~hourly restarts] Successfully rolled out to rest of fleet `sudo cumin 'C:query_service::crontasks' 'sudo run-puppet-agent --force && sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer'`
[02:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:48] <stashbot>	 T290330: Wikidata Query Service unstable in codfw - https://phabricator.wikimedia.org/T290330
[02:29:15] <wikibugs>	 (03CR) 10Cwhite: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron)
[02:35:35] <wikibugs>	 (03Abandoned) 10Cwhite: profile, elasticsearch: add option to configure systemd Before= value [puppet] - 10https://gerrit.wikimedia.org/r/666231 (https://phabricator.wikimedia.org/T275405) (owner: 10Cwhite)
[03:08:22] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1037 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1872 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[03:10:12] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1037 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[04:42:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4636168200 and 48875 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:43:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1698401776 and 48955 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:44:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 665290288 and 276 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:44:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1388724288 and 49009 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:45:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1172109200 and 493 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:45:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 516600 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:46:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 216 and 105 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:46:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 280 and 122 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:46:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 392 and 141 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:47:20] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 192 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:10:36] <wikibugs>	 (03CR) 10Marostegui: "This change was done yesterday (pending codfw)" [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1)
[05:12:42] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2103 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/721652 (https://phabricator.wikimedia.org/T290865)
[05:13:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2103 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/721652 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui)
[05:43:46] <wikibugs>	 (03PS4) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337
[06:01:48] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:04:10] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 78029276960 and 1432 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:04:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 81727635528 and 1465 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:05:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 82965504040 and 1522 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:06:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 85065977400 and 1558 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:06:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 87124828744 and 1593 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:33:58] <wikibugs>	 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10elukey)
[06:34:18] <icinga-wm>	 ACKNOWLEDGEMENT - Host cloudvirt1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Elukey T291237
[06:53:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete java::security [puppet] - 10https://gerrit.wikimedia.org/r/719261 (https://phabricator.wikimedia.org/T282454) (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210917T0700)
[07:03:28] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "sorry for this typo & thanks for fixing it!" [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[07:05:49] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::httpd: disable CGI [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355)
[07:06:47] <jinxer-wm>	 (Traffic bill over quota) firing: (2) Traffic bill over quota   - https://alerts.wikimedia.org
[07:07:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mediawiki: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/721549 (owner: 10Muehlenhoff)
[07:11:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721618 (owner: 10Legoktm)
[07:18:24] <wikibugs>	 (03CR) 10Ema: [C: 03+2] VarnishTrafficDrop: fix site label in summary [alerts] - 10https://gerrit.wikimedia.org/r/721507 (https://phabricator.wikimedia.org/T291149) (owner: 10BBlack)
[07:24:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] elasticsearch: it's ExecStartPre, not ExecPreStart [puppet] - 10https://gerrit.wikimedia.org/r/721644 (https://phabricator.wikimedia.org/T276198) (owner: 10Ryan Kemper)
[07:26:47] <jinxer-wm>	 (Traffic bill over quota) resolved: (2) Traffic bill over quota   - https://alerts.wikimedia.org
[07:27:46] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:27:46] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:27:46] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:27:48] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:27:52] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:27:58] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:02] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:06] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:10] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:10] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:18] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:22] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:23] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:23] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:23] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:24] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:29] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:29] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:29] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:30] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:40] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:28:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:28:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:28:42] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:46] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:48] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.48.228:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein:
[07:28:50] <icinga-wm>	  on connection while downloading http://10.64.48.228:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:28:52] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:53] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:53] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:53] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:28:53] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:28:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:28:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:28:58] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:06] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:06] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:07] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:07] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:07] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 3078 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:29:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:14] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:15] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html/{title} (Transform w
[07:29:15] <icinga-wm>	 to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[07:29:16] <icinga-wm>	 PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:29:16] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:24] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:24] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:29:24] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:28] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:29:32] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:29:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase1022.eqiad.wmnet, restbase1025.eqiad.wmnet, restbase1021.eqiad.wmnet, restbase1027.eqiad.wmnet, restbase1018.eqiad.wmnet, restbase1023.eqiad.wmnet, restbase1024.eqiad.wmnet, restbase1016.eqiad.wmnet, restbase1026.eqiad.wmnet, restbase1029.eqiad.wmnet, restbase1030.eqiad.wmnet, restbase1017.eqiad.wmnet, restbase102
[07:29:44] <icinga-wm>	 wmnet are marked down but pooled: parsoid-php_443: Servers wtp1029.eqiad.wmnet, wtp1048.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1041.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1030.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1047.eqiad.wmnet are marked down but pooled 
[07:29:44] <icinga-wm>	 wikitech.wikimedia.org/wiki/PyBal
[07:29:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - restbase-https_7443: Servers restbase1025.eqiad.wmnet, restbase1021.eqiad.wmnet, restbase1027.eqiad.wmnet, restbase1018.eqiad.wmnet, restbase1024.eqiad.wmnet, restbase1016.eqiad.wmnet, restbase1026.eqiad.wmnet, restbase1029.eqiad.wmnet, restbase1019.eqiad.wmnet, restbase1023.eqiad.wmnet, restbase1017.eqiad.wmnet, restbase1028.eqiad.wmnet, restbase102
[07:29:50] <icinga-wm>	 wmnet are marked down but pooled: parsoid-php_443: Servers wtp1048.eqiad.wmnet, wtp1044.eqiad.wmnet, wtp1028.eqiad.wmnet, wtp1033.eqiad.wmnet, wtp1025.eqiad.wmnet, wtp1027.eqiad.wmnet, wtp1039.eqiad.wmnet, wtp1040.eqiad.wmnet, wtp1036.eqiad.wmnet, wtp1034.eqiad.wmnet, wtp1032.eqiad.wmnet, wtp1045.eqiad.wmnet, wtp1029.eqiad.wmnet, wtp1037.eqiad.wmnet, wtp1031.eqiad.wmnet, wtp1038.eqiad.wmnet, wtp1046.eqiad.wmnet, wtp1035.eqiad.wmnet, wtp10
[07:29:50] <icinga-wm>	 .wmnet, wtp1041.eqiad.wmnet, wtp1047.eqiad.wmnet, wtp1026.eqiad.wmnet, wtp1030.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[07:29:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:29:52] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:30:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.16.173:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein:
[07:30:00] <icinga-wm>	  on connection while downloading http://10.64.16.173:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: 
[07:30:00] <icinga-wm>	 on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/transform/wikitext/to/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:06] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:30:08] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:30:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:13] <elukey>	 whattt
[07:30:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:30:16] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:30:16] <icinga-wm>	 PROBLEM - PHP7 rendering on wtp1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:30:40] <wikibugs>	 (03PS3) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305)
[07:30:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[07:30:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:01] <elukey>	 _joe_ jayme akosiaris --^
[07:31:02] <jayme>	 what is going on here? :o
[07:31:04] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[07:31:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:14] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 9.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:16] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:31:20] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 651 bytes in 2.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:23] <elukey>	 whatt
[07:31:24] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 5.962 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:29] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 7.597 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:30] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:36] <elukey>	 jayme: seemed all wtp-nodes related
[07:31:37] <jayme>	 you jinxed it elukey
[07:31:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_restbase_codfw,swagger_check_restbase_eqiad,swagger_check_restbase_esams} site={codfw,eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:31:48] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:31:50] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:31:52] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 9.909 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:52] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:54] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:54] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1039 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:55] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 0.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:56] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 0.853 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:31:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:56] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 651 bytes in 1.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:31:58] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:31:58] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 637 bytes in 2.273 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[07:32:02] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:02] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:02] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:02] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1048 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:03] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:14] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1029 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:18] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:32:19] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:22] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:24] <_joe_>	 ok I was about to say please someone else deal with it
[07:32:26] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:26] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1034 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:26] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:26] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1038 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:26] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1028 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:32] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:38] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:39] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:39] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1035 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:39] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1027 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:39] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:46] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:48] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:32:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:49] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1025 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:49] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:32:50] <icinga-wm>	 RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[07:32:54] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:32:56] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1036 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:32:56] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1033 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:32:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:32:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:04] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1045 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:33:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:12] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1044 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:33:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:16] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1041 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:33:22] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1030 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:33:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:33:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:28] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:33:30] <icinga-wm>	 RECOVERY - PHP7 rendering on wtp1047 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:33:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[07:33:34] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1040 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:33:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:33:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:34:22] <wikibugs>	 (03CR) 10Jelto: services: deploy services with helm3 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[07:35:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[07:47:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey)
[07:49:56] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442)
[07:53:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add mysql grants for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/712993 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo)
[07:56:02] <wikibugs>	 (03PS1) 10Marostegui: report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760
[07:56:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760 (owner: 10Marostegui)
[07:57:12] <wikibugs>	 (03Merged) 10jenkins-bot: report_users.sh: Change session binlog to ROW [software] - 10https://gerrit.wikimedia.org/r/721760 (owner: 10Marostegui)
[08:00:08] <jayme>	 !log restarting php-fpm on wtp1037 and wtp1030
[08:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:24] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:11:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::httpd: disable CGI [puppet] - 10https://gerrit.wikimedia.org/r/721755 (https://phabricator.wikimedia.org/T285355) (owner: 10Elukey)
[08:11:58] <elukey>	 jynus: o/ ok to puppet-merge?
[08:12:06] <jynus>	 ups, sorry
[08:12:12] <jynus>	 did it get merged?
[08:12:16] <wikibugs>	 (03PS1) 10Marostegui: report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763
[08:12:23] <jynus>	 but yes, it is a noop
[08:12:46] <elukey>	 ack proceeding :)
[08:12:55] <jynus>	 I thought I only +2d it and I was waiting for CI
[08:13:02] <jynus>	 sorry
[08:13:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763 (owner: 10Marostegui)
[08:13:16] <elukey>	 jynus: no problem at all!
[08:14:02] <wikibugs>	 (03Merged) 10jenkins-bot: report_users: Add m5 proxy IP [software] - 10https://gerrit.wikimedia.org/r/721763 (owner: 10Marostegui)
[08:50:42] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey) To keep archives happy - we currently don't deploy `systemd-coredump` on our hosts (because of the reasons highlighted above), so the dumps are not...
[08:52:10] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10elukey)
[08:56:06] <wikibugs>	 (03PS5) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337
[08:57:05] <wikibugs>	 (03PS6) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337
[09:19:00] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@37e904a]: Only syncing sanitize allowlist
[09:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:29:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:29:52] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:30:16] <mvolz>	 Hi there, I've noticed a big/weird increase in traffic to citoid starting this Tuesday. https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-7d&to=now
[09:30:43] <mvolz>	 Notably we're getting a lot of requests for the "zotero" format which to my knowledge is not used by us at all. 
[09:31:21] <mvolz>	 So I suspect this big increase is probably foreign traffic. Might be worth checking out ahead of time though it hasn't broken anything yet (as far as I know!)
[09:31:38] <mvolz>	 In case it is or will be a problem
[09:36:43] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@37e904a]: Only syncing sanitize allowlist (duration: 17m 43s)
[09:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:20] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:37:23] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@37e904a] (thin): Only syncing sanitize allowlist, deploying THIN for consistency
[09:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:26] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[09:37:30] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@37e904a] (thin): Only syncing sanitize allowlist, deploying THIN for consistency (duration: 00m 07s)
[09:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:16] <wikibugs>	 (03PS2) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[09:39:16] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[09:39:20] <wikibugs>	 10SRE, 10SRE Observability, 10Traffic: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) 05Open→03Resolved a:03ema >>! In T291149#7361022, @gerritbot wrote: > %%%[operations/alerts@master] VarnishTrafficDrop: fix site label in summary%%% >...
[09:39:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[09:40:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) Thanks @MBinder_WMF, I've reached out on Slack as well to verify this via another channel (can't be too careful).  Reply back there whenever you've a moment.  Disappointed...
[09:45:20] <wikibugs>	 (03PS3) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[09:45:48] <wikibugs>	 (03PS1) 10Elukey: mtail: add counter for kernel traps [puppet] - 10https://gerrit.wikimedia.org/r/721773 (https://phabricator.wikimedia.org/T246470)
[09:47:27] <wikibugs>	 (03CR) 10Tobias Andersson: [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[09:51:32] <majavah>	 mvolz: the services dc switchover was on Tuesday, if you look at https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-7d&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=citoid you can see how the request volume switched over
[09:57:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "I’m getting CSP errors when trying this:" [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[09:59:39] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10aborrero)
[10:00:17] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) My current plan for cluster-wise migration and deploying services with helm3 is:   * make sure cluster is depooled  * delete helm releases for all services  * remove tiller compon...
[10:00:42] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:01:43] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:06:58] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) >>! In T290870#7351871, @fgiunchedi wrote: > I think either approach will work as a bandaid, intuitively a rsyslog-native solution seems better to me (assumi...
[10:07:12] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10ema) p:05Triage→03Medium
[10:07:56] <wikibugs>	 (03Abandoned) 10Ema: rsyslog: config sanity check as systemd override [puppet] - 10https://gerrit.wikimedia.org/r/720913 (https://phabricator.wikimedia.org/T290870) (owner: 10Ema)
[10:18:37] <wikibugs>	 (03PS7) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371
[10:27:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch seems correct, but given how convoluted envoy configs are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli)
[10:27:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff)
[10:32:58] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[10:36:06] <wikibugs>	 (03CR) 10Tobias Andersson: [C: 04-1] miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[10:39:33] <wikibugs>	 (03PS4) 10Tobias Andersson: miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[10:40:52] <wikibugs>	 (03PS8) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking [puppet] - 10https://gerrit.wikimedia.org/r/707371
[10:44:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 2 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) We reached at some points (with the dc depooled, during night) peaks of 150 files/s, but it got as low as 6 files/s fo...
[10:48:02] <wikibugs>	 (03PS10) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771)
[10:51:41] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[10:54:30] <wikibugs>	 (03Merged) 10jenkins-bot: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos)
[11:00:05] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[11:08:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-updates-report: Adapt to new OS tracking [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff)
[11:08:57] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524537401864 and 19629 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:08:57] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524394009144 and 19614 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:08:57] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 526308446784 and 19684 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:08:57] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524394009144 and 19640 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:08:57] <icinga-wm>	 ACKNOWLEDGEMENT - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 524494148128 and 19694 seconds Hnowlan Replication broken by planet resync - resync required. https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[11:09:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:01] <wikibugs>	 (03PS1) 10Muehlenhoff: os tracking: Commit missíng data file [puppet] - 10https://gerrit.wikimedia.org/r/721788
[11:13:15] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[11:14:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os tracking: Commit missíng data file [puppet] - 10https://gerrit.wikimedia.org/r/721788 (owner: 10Muehlenhoff)
[11:14:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[11:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:07] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[11:28:13] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[11:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:59] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): miscweb: Add CSP headers for query builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[11:34:27] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[12:05:48] <wikibugs>	 (03PS1) 10Muehlenhoff: os-reports: Small followup fixes [puppet] - 10https://gerrit.wikimedia.org/r/721804
[12:06:58] <wikibugs>	 10SRE, 10serviceops: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) Ack, I'll upload to apt.wikimedia.org on Monday.
[12:12:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Small followup fixes [puppet] - 10https://gerrit.wikimedia.org/r/721804 (owner: 10Muehlenhoff)
[12:13:55] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[12:31:19] <wikibugs>	 10SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137 (10Nikola_Smolenski) If it is of any help, when I try to open the file with Adobe's acroread, it reports "stat buffer overflow" error, while evince opens it ni...
[12:32:22] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257)
[12:44:00] <wikibugs>	 (03PS4) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305)
[12:45:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[12:46:01] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[12:47:35] <wikibugs>	 (03CR) 10Tobias Andersson: [C: 03+1] miscweb: Add CSP headers for query builder [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[12:48:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Install swaks on mail servers [puppet] - 10https://gerrit.wikimedia.org/r/721809
[12:48:36] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911)
[12:50:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I will hopefully get it deployed soon" [puppet] - 10https://gerrit.wikimedia.org/r/708463 (https://phabricator.wikimedia.org/T285761) (owner: 10Ladsgroup)
[12:52:20] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "That would install the package on the Jenkins agent, however we no more run commands directly from the agent. Instead the execution enviro" [puppet] - 10https://gerrit.wikimedia.org/r/720402 (owner: 10Ebernhardson)
[12:57:27] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:06:22] <moritzm>	 !log installing 4.9.272 kernels on stretch hosts (no reboots yet)
[13:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:22] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Change URL of xmldatadumps-l from mailman2 to mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/721811 (https://phabricator.wikimedia.org/T282303)
[13:48:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10Perryprog) This issue has been reported on Znuny on Ticket#2021091710000804, so it's safe to assume the problem is still pres...
[13:59:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "This also does things like git pulling deployment-charts etc..hmm.. not just a few files.. but testing it on the inactive server." [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy)
[14:03:37] <wikibugs>	 (03CR) 10Dzahn: "deployed, did not see issues with it" [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy)
[14:12:54] <wikibugs>	 (03PS1) 10Dzahn: thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818
[14:13:46] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/31129/thumbor1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[14:16:53] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway: allow /staging/ testing namespace only in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/715467 (https://phabricator.wikimedia.org/T289583)
[14:21:10] <wikibugs>	 10SRE, 10Performance-Team, 10Thumbor, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) Moving back for re-triage as it's been dormant 6 months in a columnn for things "this quarter".
[14:21:17] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:41] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on thumbor2002 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/719543 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:32] <wikibugs>	 (03PS1) 10Dzahn: thumbor: fix location of generate-thumbor-age-metrics-nc.sh [puppet] - 10https://gerrit.wikimedia.org/r/721823
[14:25:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] thumbor: fix location of generate-thumbor-age-metrics-nc.sh [puppet] - 10https://gerrit.wikimedia.org/r/721823 (owner: 10Dzahn)
[14:26:09] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1004 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:25] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The following units failed: thumbor_process_age_statsd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:23] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:03] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:21] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:53] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:20] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[14:29:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2856 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:47] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Had a very nice conversation with Joe about pros and cons of this approach, and a good suggestion that came up was to avoid duplicating hi" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[14:42:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10cmooney) Recent email exchange on this as they contacted us again:  > From: Cathal Mooney [mailto:cmooney@wikimedia.org] > Se...
[14:43:38] <wikibugs>	 (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/c/operations/puppet/+/721823 but is ok now" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[14:44:07] <wikibugs>	 (03PS2) 10Dzahn: thumbor: remove absented cron code for generate-thumbor-age-metrics [puppet] - 10https://gerrit.wikimedia.org/r/721589 (https://phabricator.wikimedia.org/T273673)
[14:47:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] thumbor: remove absented cron code for generate-thumbor-age-metrics [puppet] - 10https://gerrit.wikimedia.org/r/721589 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[14:49:57] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[14:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:21] <wikibugs>	 (03PS6) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[14:50:57] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks Daniel Z!" [puppet] - 10https://gerrit.wikimedia.org/r/721615 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy)
[14:51:09] <wikibugs>	 (03CR) 10Dzahn: "friendly reminder this still needs deployment" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe)
[14:51:27] <wikibugs>	 (03CR) 10Hnowlan: apt::package_from_component: add update condition for multiple packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721275 (owner: 10Hnowlan)
[14:56:54] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257)
[14:56:56] <wikibugs>	 (03PS1) 10Btullis: Add temporary rsync modules to two Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755)
[14:57:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez)
[14:58:40] <wikibugs>	 (03PS1) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850
[15:01:07] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31130/console" [puppet] - 10https://gerrit.wikimedia.org/r/721849 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[15:06:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Updating mbinder's keys for phabricator-bulk-manager - https://phabricator.wikimedia.org/T291141 (10cmooney) p:05Triage→03Medium a:03cmooney
[15:09:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] scaffold: add more options for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 (owner: 10Effie Mouzeli)
[15:10:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: mediawiki::packages: remove libvips (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720974 (https://phabricator.wikimedia.org/T290759) (owner: 10Giuseppe Lavagetto)
[15:11:47] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257)
[15:12:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "the diff in the CI output corresponds to what we expect" [deployment-charts] - 10https://gerrit.wikimedia.org/r/721328 (owner: 10Giuseppe Lavagetto)
[15:12:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: bootstrap manila component [puppet] - 10https://gerrit.wikimedia.org/r/721805 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez)
[15:12:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Replacing SSH pub key for mbinder as he rebuilt his laptop. [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141)
[15:12:36] <wikibugs>	 (03PS1) 10Dzahn: rancid: convert crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/721854 (https://phabricator.wikimedia.org/T273673)
[15:13:29] <wikibugs>	 (03PS2) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850
[15:15:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: drop stale README file [puppet] - 10https://gerrit.wikimedia.org/r/721855
[15:16:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop stale README file [puppet] - 10https://gerrit.wikimedia.org/r/721855 (owner: 10Arturo Borrero Gonzalez)
[15:16:25] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: fix longstanding bug with mcrouter cross-dc encryption [deployment-charts] - 10https://gerrit.wikimedia.org/r/721328 (owner: 10Giuseppe Lavagetto)
[15:22:12] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: allow injecting the wmerrors script [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851)
[15:24:42] <wikibugs>	 (03PS1) 10ZPapierski: Add kafka clusters' brokers to spicerack config [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469)
[15:24:46] <wikibugs>	 (03PS3) 10Elukey: Refactor kubernetes tokens and secrets [labs/private] - 10https://gerrit.wikimedia.org/r/721850
[15:32:00] <wikibugs>	 (03PS22) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791)
[15:32:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Output from CI seems correct." [deployment-charts] - 10https://gerrit.wikimedia.org/r/721341 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto)
[15:37:36] <wikibugs>	 (03PS23) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791)
[15:37:45] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10cmooney) This was completed yesterday evening for all affected hosts, and all are now reporting an LLDP neighbour as exp...
[15:38:08] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Still not ready" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[15:40:46] <wikibugs>	 (03PS24) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791)
[15:43:14] <wikibugs>	 (03PS25) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791)
[15:44:26] <wikibugs>	 (03PS26) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791)
[15:45:18] <wikibugs>	 (03PS7) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337
[15:46:19] <wikibugs>	 (03CR) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337 (owner: 10Effie Mouzeli)
[15:47:40] <wikibugs>	 (03CR) 10Elukey: "Joe: not sure if this is what you have in mind, it is still not 100% configurable via hiera (especially deployment_server.pp) but it is su" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[15:48:32] <wikibugs>	 (03PS3) 10Effie Mouzeli: tegola-vector-tiles: use v0.3 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019
[15:51:54] <wikibugs>	 (03PS4) 10Effie Mouzeli: tegola-vector-tiles: use v0.4 templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/720019
[15:53:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818 (owner: 10Dzahn)
[15:54:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: releases: Install private data for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629)
[15:55:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] thumbor: add a system::role describing it [puppet] - 10https://gerrit.wikimedia.org/r/721818 (owner: 10Dzahn)
[15:56:12] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] releases: Install private data for mwdebug service [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy)
[16:02:20] <wikibugs>	 (03PS5) 10Jelto: services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305)
[16:04:13] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:04:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] services: deploy services with helm3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/721301 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[16:10:33] <wikibugs>	 (03PS1) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864
[16:11:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864 (owner: 10Effie Mouzeli)
[16:14:17] <wikibugs>	 (03PS2) 10Effie Mouzeli: fixtures: add fixtures for tcp_services_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721864
[16:25:16] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:44] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:42] <wikibugs>	 10SRE, 10NavigationTiming, 10Performance-Team, 10Patch-For-Review: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10dpifke) I've enabled the TLS listener in deployment-prep and confirmed the Navtiming patch works.  Next up: Coal.
[16:42:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet - https://phabricator.wikimedia.org/T290318 (10RobH) ` robh@labstore1005:~$ sudo journalctl -S "2021-09-15" | grep "Controller encountered a fatal error and was reset" | cut -d: -f 1 | sort | uniq...
[16:45:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1022.eqiad.wmnet ` The log can be found in...
[16:48:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:48:17] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[16:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in...
[16:59:19] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1030's mgmt seems unreachable from icinga - https://phabricator.wikimedia.org/T291237 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed
[17:00:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: REIMAGE
[17:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:06] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[17:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1022.eqiad.wmnet with reason: REIMAGE
[17:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:58] <icinga-wm>	 RECOVERY - Host cloudvirt1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[17:11:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1022.eqiad.wmnet'] `  and were **ALL** successful.
[17:17:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson)
[17:18:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) Moved both of the servers to cloudsw1.  cloudcephosd1022 was installed with zero issues and is now set to staged. Cloudcephosd1021 is still having issu...
[17:33:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] `
[17:53:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "This is like existing data...so it's fine. Though "profile::kubernetes::deployment_server::services" is not actually a class that exists.." [puppet] - 10https://gerrit.wikimedia.org/r/721860 (https://phabricator.wikimedia.org/T288629) (owner: 10Ahmon Dancy)
[18:27:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721853 (https://phabricator.wikimedia.org/T291141) (owner: 10Cathal Mooney)
[18:39:07] <wikibugs>	 (03Abandoned) 10Ebernhardson: Include shellcheck on ci slave instances [puppet] - 10https://gerrit.wikimedia.org/r/720402 (owner: 10Ebernhardson)
[18:42:47] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Switch remaining MX records to mx2001 [dns] - 10https://gerrit.wikimedia.org/r/721555 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff)
[18:43:09] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Install swaks on mail servers [puppet] - 10https://gerrit.wikimedia.org/r/721809 (owner: 10Muehlenhoff)
[18:46:39] <wikibugs>	 (03PS3) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615)
[18:48:31] <wikibugs>	 (03CR) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[18:48:35] <wikibugs>	 (03PS4) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615)
[18:54:06] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[18:58:49] <wikibugs>	 (03PS1) 10Herron: fix missing ' [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721882
[18:59:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:00:21] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[19:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:22] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] fix missing ' [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721882 (owner: 10Herron)
[19:12:03] <wikibugs>	 (03PS1) 10Herron: Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836
[19:13:09] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 (owner: 10Herron)
[19:14:08] <wikibugs>	 (03PS2) 10Herron: Revert "slo_dashboard: switch etcd request slo query to recording rule metrics" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836
[19:14:35] <wikibugs>	 (03PS1) 10Ssingh: durum: add motd ASCII art [puppet] - 10https://gerrit.wikimedia.org/r/721888
[19:16:26] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31132/console" [puppet] - 10https://gerrit.wikimedia.org/r/721888 (owner: 10Ssingh)
[19:16:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add motd ASCII art [puppet] - 10https://gerrit.wikimedia.org/r/721888 (owner: 10Ssingh)
[19:16:44] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] "that's what I get for deploying towards the end of friday! reverting, will revisit next week" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/721836 (owner: 10Herron)
[19:17:27] <wikibugs>	 (03PS7) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257)
[19:19:19] <wikibugs>	 (03CR) 10Nikki Nikkhoui: Helmfile for image suggestion api (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui)
[19:19:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui)
[19:40:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:41:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu) 05Open→03Resolved @akosiaris I approve the shell access for @mewoph. Sorry for the delay
[19:41:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu)
[19:41:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:42:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10DMburugu) 05Resolved→03In progress
[19:48:18] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10Urbanecm) a:05DMburugu→03None Resetting assignee and moving to untriaged to make sure clinic duty sees this.
[19:53:32] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159)
[19:55:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli)
[19:56:09] <wikibugs>	 (03PS4) 10Michael DiPietro: create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204)
[19:57:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] create role to deploy staging instance for quarry [puppet] - 10https://gerrit.wikimedia.org/r/721585 (https://phabricator.wikimedia.org/T291204) (owner: 10Michael DiPietro)
[20:03:37] <wikibugs>	 (03PS2) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159)
[20:04:28] <wikibugs>	 (03PS8) 10Effie Mouzeli: common_templates: add support for envoy tcp proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/721337
[20:15:38] <wikibugs>	 (03PS3) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159)
[20:17:50] <wikibugs>	 (03PS4) 10Effie Mouzeli: tegola-version-tiles: enable tcp load balancer for postgres [deployment-charts] - 10https://gerrit.wikimedia.org/r/721894 (https://phabricator.wikimedia.org/T283159)
[20:21:46] <wikibugs>	 (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio)
[21:10:10] <wikibugs>	 (03PS8) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257)
[21:13:19] <icinga-wm>	 PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:16:08] <wikibugs>	 (03PS9) 10Nikki Nikkhoui: Helmfile for image suggestion api [deployment-charts] - 10https://gerrit.wikimedia.org/r/697733 (https://phabricator.wikimedia.org/T281257)
[21:19:25] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.dns.netbox
[21:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:35] <wikibugs>	 (03CR) 10Mxn: [C: 03+1] "Thank you for taking care of this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio)
[21:28:01] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:10] <wikibugs>	 (03PS1) 10MusikAnimal: Enable DisamiguatorNotifications on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721902 (https://phabricator.wikimedia.org/T291303)
[22:03:27] <wikibugs>	 (03PS1) 10Legoktm: Add LVS for new Shellboxes: media, syntaxhighlight & timeline [puppet] - 10https://gerrit.wikimedia.org/r/721904 (https://phabricator.wikimedia.org/T289226)
[22:03:30] <wikibugs>	 (03PS1) 10Legoktm: service: Switch new Shellboxes to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/721905 (https://phabricator.wikimedia.org/T289226)
[22:03:34] <wikibugs>	 (03PS1) 10Legoktm: service: Switch new Shellboxes to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/721906 (https://phabricator.wikimedia.org/T289226)
[22:03:36] <wikibugs>	 (03PS1) 10Legoktm: Add *.svc.{codfw,eqiad}.wmnet entries for new Shellboxes [dns] - 10https://gerrit.wikimedia.org/r/721908 (https://phabricator.wikimedia.org/T289226)
[22:03:38] <wikibugs>	 (03PS1) 10Legoktm: service: Switch new Shellboxes to production [puppet] - 10https://gerrit.wikimedia.org/r/721907 (https://phabricator.wikimedia.org/T289226)
[22:03:40] <wikibugs>	 (03PS1) 10Legoktm: Add new Shellboxes to discovery [dns] - 10https://gerrit.wikimedia.org/r/721909 (https://phabricator.wikimedia.org/T289226)
[23:14:59] <icinga-wm>	 RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:40:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:42:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:54:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:55:51] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down