[00:02:21] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:27] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:06:19] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:12:03] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:15:51] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:21:31] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the mos
[00:21:31] <icinga-wm>	 rticles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:25:05] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:25:21] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:33:07] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:34:55] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting:
[00:34:55] <icinga-wm>	 {domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:36:47] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:42:31] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned th
[00:42:31] <icinga-wm>	 cted status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:46:19] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:51:59] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu
[00:51:59] <icinga-wm>	 e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:56:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:58:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:01:11] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:57] <wikibugs>	 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10Andrew) Strange -- after I made my changes last week I doublechecked that puppet was working properl...
[01:16:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:24:21] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:25:53] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:07] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:33:57] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:39:41] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:41:33] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:47:15] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu
[01:47:15] <icinga-wm>	 e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:49:07] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:54:51] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:56:45] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[01:57:35] <wikibugs>	 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10Andrew) 05Open→03Resolved a:03Andrew I still don't know what this was but I regenerated all th...
[02:02:01] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158
[02:07:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158 (owner: 10TrainBranchBot)
[02:08:15] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:10:09] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:21:33] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret
[02:21:33] <icinga-wm>	 e unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article
[02:21:33] <icinga-wm>	 nuary 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:24:03] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158 (owner: 10TrainBranchBot)
[02:26:43] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:36:51] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:50:11] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:52:03] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[02:59:41] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200): /{doma
[02:59:41] <icinga-wm>	 age/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[03:02:49] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:13:05] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[03:18:47] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret
[03:18:47] <icinga-wm>	 e unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[03:25:37] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:53] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[03:43:35] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[03:56:53] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:01:45] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:02:37] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret
[04:02:37] <icinga-wm>	 e unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:08:23] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:14:09] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:26:33] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:07] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:36:39] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:36:45] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:40:23] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:45:57] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 
[04:45:57] <icinga-wm>	 ng: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:49:39] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:55:11] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned th
[04:55:11] <icinga-wm>	 cted status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:57:03] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:01:41] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:02:37] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpecte
[05:02:37] <icinga-wm>	  504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:06:25] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:08:32] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2090: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/719167
[05:14:55] <marostegui>	 !log Optimize kawiki.flaggedtemplates in eqiad T290057
[05:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:00] <stashbot>	 T290057: Optimize flaggedtemplates tables in production - https://phabricator.wikimedia.org/T290057
[05:15:20] <marostegui>	 !log Optimize vecwiki.flaggedtemplates in eqiad T290057
[05:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:26] <marostegui>	 !log Optimize eowiki.flaggedtemplates in eqiad T290057
[05:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:15:59] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the mos
[05:15:59] <icinga-wm>	 rticles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News conten
[05:15:59] <icinga-wm>	 ed the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:26:09] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:55] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:59:05] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 r
[05:59:05] <icinga-wm>	 the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:01:35] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:09] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:10:15] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:10:37] <wikibugs>	 10SRE, 10Commons, 10Traffic-Icebox, 10Wikidata, and 4 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10eranroz)
[06:11:26] <wikibugs>	 10SRE, 10Commons, 10Traffic-Icebox, 10Wikidata, and 4 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10eranroz) This also applies to wikidata.
[06:15:59] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting:
[06:15:59] <icinga-wm>	 {domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:23:35] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:26:03] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:29:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2090: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/719167 (owner: 10Marostegui)
[06:32:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10ChristineDeKock) Thanks! It now works.
[06:34:43] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:38:20] <wikibugs>	 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) From 2021-09-04 restbase has been reporting a lot of connection errors (to what it seems Wikifeeds judgding from the URI): https://logstash.wikimedia.org/goto...
[06:38:29] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:47:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 5%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17228 and previous config saved to /var/cache/conftool/dbconfig/20210907-064711-root.json
[06:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:18] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[07:02:02] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:02:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 10%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17229 and previous config saved to /var/cache/conftool/dbconfig/20210907-070215-root.json
[07:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:21] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[07:04:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi)
[07:07:03] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Fixing db2118's pooling config T288244', diff saved to https://phabricator.wikimedia.org/P17230 and previous config saved to /var/cache/conftool/dbconfig/20210907-070702-kormat.json
[07:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:08] <stashbot>	 T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244
[07:07:24] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 25%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17231 and previous config saved to /var/cache/conftool/dbconfig/20210907-070724-kormat.json
[07:07:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:51] <wikibugs>	 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10MSantos) The source of the failure could be this one in Wikifeeds https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.09.07?id=jmADv3sB9aenX452C...
[07:13:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[07:13:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[07:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:42] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:17:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 25%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17232 and previous config saved to /var/cache/conftool/dbconfig/20210907-071719-root.json
[07:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:25] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[07:20:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Set core sections to unidir replication." [puppet] - 10https://gerrit.wikimedia.org/r/719168
[07:21:17] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the switch back" [puppet] - 10https://gerrit.wikimedia.org/r/719168 (owner: 10Marostegui)
[07:22:28] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 50%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17233 and previous config saved to /var/cache/conftool/dbconfig/20210907-072227-kormat.json
[07:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:34] <stashbot>	 T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244
[07:25:00] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:25:23] <wikibugs>	 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) Thanks @MSantos   Update from IRC: me and @JMeybohm noticed that in the k8s wikifeeds graphs, the rise of the errors (Sept 4th ~02:30 UTC) corresponded to a b...
[07:27:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi
[07:32:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 50%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17234 and previous config saved to /var/cache/conftool/dbconfig/20210907-073222-root.json
[07:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:29] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[07:34:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Start to pool db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17235 and previous config saved to /var/cache/conftool/dbconfig/20210907-073436-marostegui.json
[07:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:24] <godog>	 !log +100G for prometheus/k8s codfw
[07:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:31] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 75%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17236 and previous config saved to /var/cache/conftool/dbconfig/20210907-073731-kormat.json
[07:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:36] <stashbot>	 T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244
[07:46:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle)
[07:47:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 75%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17237 and previous config saved to /var/cache/conftool/dbconfig/20210907-074726-root.json
[07:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:31] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[07:49:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight for db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17238 and previous config saved to /var/cache/conftool/dbconfig/20210907-074901-marostegui.json
[07:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:35] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 100%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17239 and previous config saved to /var/cache/conftool/dbconfig/20210907-075235-kormat.json
[07:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:40] <stashbot>	 T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244
[07:53:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Thank you for the reviews!" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi)
[08:02:00] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 100%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17240 and previous config saved to /var/cache/conftool/dbconfig/20210907-080230-root.json
[08:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:35] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[08:07:11] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Package uploaded and upgraded on thanos-fe hosts, resolving
[08:09:47] <wikibugs>	 (03CR) 10Klausman: Add revscoring-editquality as first ml-service to helmfile.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey)
[08:16:08] <wikibugs>	 (03PS1) 10MVernon: pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119)
[08:19:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon)
[08:22:52] <wikibugs>	 (03CR) 10Awight: Set template namespace for code mirror line numbering (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[08:22:59] <wikibugs>	 (03PS3) 10Awight: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[08:23:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261)
[08:24:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:24:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews! I've removed the POC status since code seems fine as-is, I've tested this in Pontoon and it works as expected. " [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[08:25:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:25:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:20] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon)
[08:29:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31015/console" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[08:29:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight for db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17241 and previous config saved to /var/cache/conftool/dbconfig/20210907-082952-marostegui.json
[08:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:57] <stashbot>	 T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803
[08:31:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1008.eqiad.wmnet
[08:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:38] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon)
[08:36:56] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137)
[08:37:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats (031 comment) [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[08:37:20] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add worker nodes to ml_serve [puppet] - 10https://gerrit.wikimedia.org/r/719225 (https://phabricator.wikimedia.org/T289835)
[08:37:22] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835)
[08:37:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[08:42:12] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1008.eqiad.wmnet
[08:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:47] <Emperor>	 !log removing pc1008 from tendril and zarcillo T289119
[08:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:52] <stashbot>	 T289119: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119
[08:46:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10MatthewVernon) This host is ready for DC-Ops to decommission
[08:51:13] <Emperor>	 !log removing pc1008 from orchestrator T289119
[08:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:17] <stashbot>	 T289119: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119
[08:53:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[08:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:37] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10MatthewVernon) a:05MatthewVernon→03wiki_willy
[08:54:26] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10MatthewVernon) This host is ready for DC-Ops to decommission
[08:57:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:34] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:02:55] <wikibugs>	 (03PS1) 10Elukey: Add inference eqiad service record [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835)
[09:04:03] <wikibugs>	 (03CR) 10Elukey: "Record already added in netbox (and cookbook executed)" [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[09:05:00] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs20
[09:05:00] <icinga-wm>	 .wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:08:11] <wikibugs>	 (03CR) 10Volans: "All good on Netbox (assigned+reserved), thanks!" [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[09:08:36] <wikibugs>	 (03PS1) 10MVernon: pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120)
[09:09:16] <wikibugs>	 (03CR) 10Vgutierrez: sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[09:12:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={jmx_wdqs_blazegraph,mysql-parsercache} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:13:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120) (owner: 10MVernon)
[09:15:35] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120) (owner: 10MVernon)
[09:16:06] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1009.eqiad.wmnet
[09:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs20
[09:19:06] <icinga-wm>	 .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:21:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:23:38] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:25:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:25:16] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1009.eqiad.wmnet
[09:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:10] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 3.550 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:26:13] <Emperor>	 !log removing pc1009 from tendril and zarcillo T289120
[09:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:17] <stashbot>	 T289120: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120
[09:27:16] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:31:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Try restarting rsyslog on package installation [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137)
[09:37:06] <wikibugs>	 (03PS1) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755)
[09:40:42] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:40:53] <wikibugs>	 (03CR) 10Btullis: "Adding @Filippo for review, as this is a prometheus scrape target change." [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:42:53] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31016/console" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:43:21] <wikibugs>	 (03PS1) 10Volans: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234
[09:43:28] <wikibugs>	 (03PS1) 10Elukey: istio: change ingress gateway nodeport to 4688 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719235 (https://phabricator.wikimedia.org/T289835)
[09:44:43] <wikibugs>	 (03CR) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:45:51] <wikibugs>	 (03PS1) 10MMandere: varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639)
[09:46:10] <wikibugs>	 (03PS5) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771)
[09:46:15] <Emperor>	 !log removing pc1009 from orchestrator T289120
[09:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:20] <stashbot>	 T289120: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120
[09:46:20] <wikibugs>	 (03PS2) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755)
[09:46:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[09:46:44] <wikibugs>	 (03PS3) 10Btullis: Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755)
[09:48:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline for nits" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:48:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:48:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31017/console" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:48:50] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10MatthewVernon) a:03wiki_willy
[09:49:00] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10MatthewVernon) This host is ready for DC-Ops to decommission
[09:49:51] <wikibugs>	 (03PS4) 10Btullis: Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755)
[09:50:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719119 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[09:50:36] <wikibugs>	 (03CR) 10Btullis: "Thanks for spotting that." [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:50:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:51:57] <wikibugs>	 (03PS5) 10Btullis: Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755)
[09:51:59] <wikibugs>	 (03PS1) 10MVernon: pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122)
[09:52:07] <wikibugs>	 (03CR) 10Btullis: Add a prometheus scrape target for the aqs_next role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:52:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:52:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[09:54:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis)
[09:58:09] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122) (owner: 10MVernon)
[10:01:48] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122) (owner: 10MVernon)
[10:02:02] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:44] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1010.eqiad.wmnet
[10:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:23] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] puppetmaster: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719119 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere)
[10:10:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans)
[10:10:51] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835)
[10:10:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans)
[10:11:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] istio: change ingress gateway nodeport to 4688 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[10:11:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[10:13:23] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[10:13:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans)
[10:15:26] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1010.eqiad.wmnet
[10:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans)
[10:16:03] <wikibugs>	 (03PS1) 10MVernon: pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115)
[10:17:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans)
[10:22:38] <Emperor>	 !log removing pc1010 from tendril and zarcillo T289122
[10:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:43] <stashbot>	 T289122: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122
[10:23:58] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:00] <Emperor>	 !log removing pc1010 from orchestrator T289122
[10:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:52] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[10:27:52] <icinga-wm>	 unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec
[10:27:52] <icinga-wm>	 us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:28:06] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[10:28:06] <icinga-wm>	 unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec
[10:28:06] <icinga-wm>	 us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:28:18] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10MatthewVernon) a:05MatthewVernon→03wiki_willy
[10:28:20] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[10:28:20] <icinga-wm>	 unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec
[10:28:20] <icinga-wm>	 us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:28:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[10:28:30] <icinga-wm>	 unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec
[10:28:30] <icinga-wm>	 us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:11] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10MatthewVernon) This host is ready for DC-Ops to decommission
[10:29:18] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie
[10:29:18] <icinga-wm>	 ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the
[10:29:18] <icinga-wm>	 ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:18] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie
[10:29:19] <icinga-wm>	 ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the
[10:29:19] <icinga-wm>	 ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:21] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie
[10:29:22] <icinga-wm>	 ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the
[10:29:22] <icinga-wm>	 ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:24] <icinga-wm>	 ACKNOWLEDGEMENT - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie
[10:29:25] <icinga-wm>	 ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the
[10:29:25] <icinga-wm>	 ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:29:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: commissioning aqs_new hosts
[10:29:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: commissioning aqs_new hosts
[10:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on 6 hosts with reason: commissioning aqs_new hosts
[10:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: commissioning aqs_new hosts
[10:29:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:12] <hnowlan>	 apologies for the noise :) 
[10:31:15] <wikibugs>	 (03PS1) 10MVernon: pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116)
[10:32:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] prospector: disable E203 for pep-8 over black [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans)
[10:32:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] style: if no local modifications check last commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans)
[10:32:39] <wikibugs>	 (03PS3) 10Jbond: puppet_agent_stats: add catalog version to prom metricts [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585)
[10:32:47] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[10:32:58] <wikibugs>	 (03PS2) 10MVernon: pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115)
[10:33:18] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[10:34:09] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon)
[10:34:31] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[10:35:38] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2008.codfw.wmnet
[10:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:30] <wikibugs>	 (03Merged) 10jenkins-bot: prospector: disable E203 for pep-8 over black [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans)
[10:37:48] <wikibugs>	 (03Merged) 10jenkins-bot: style: if no local modifications check last commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans)
[10:38:52] <wikibugs>	 (03Merged) 10jenkins-bot: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans)
[10:40:11] <wikibugs>	 (03PS2) 10Volans: netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131
[10:41:02] <icinga-wm>	 RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:41:51] <wikibugs>	 (03PS1) 10MVernon: pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117)
[10:46:38] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2008.codfw.wmnet
[10:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:39] <wikibugs>	 (03CR) 10Volans: [C: 03+2] netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans)
[10:48:04] <wikibugs>	 (03PS2) 10Volans: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234
[10:49:31] <Emperor>	 !log removing pc2008 from tendril and zarcillo T289115
[10:49:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:36] <stashbot>	 T289115: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115
[10:50:27] <wikibugs>	 (03PS4) 10Volans: puppet_agent_stats: add catalog version to prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[10:50:59] <wikibugs>	 (03CR) 10Volans: "Sorry I had forgot to hit sent on the datapoints, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[10:51:11] <Emperor>	 !log removing pc2008 from orchestrator T289115
[10:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:56] <wikibugs>	 (03Merged) 10jenkins-bot: netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans)
[10:55:31] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10MatthewVernon) a:05MatthewVernon→03Papaul
[10:55:35] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10MatthewVernon) This host is ready for DC-Ops to decommission
[10:57:17] <wikibugs>	 (03CR) 10Jbond: puppet_agent_stats: add catalog version to prom metrics (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[10:57:52] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans)
[10:58:16] <wikibugs>	 (03PS2) 10MVernon: pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1100).
[11:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[11:00:16] <urbanecm>	 indeed, nothing to do :/
[11:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans)
[11:09:01] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10jbond) In relation to puppet i think we could look again at creating a puppet [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/143788/ |...
[11:13:25] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:14:02] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm)
[11:16:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10NForrester) I can confirm that SSH access is working and the initial kerberos password has been changed.  Thank you kindly for...
[11:16:55] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10KLevan) With Nahid's help we have set up the Kerberos password and everything is working fine. Thank you all for your work.
[11:18:48] <awight>	 urbanecm: I'll jump in with a minor patch, unless there's other activity?
[11:19:04] <urbanecm>	 awight: none that I'd be aware of -- go ahead.
[11:19:46] <awight>	 :-)
[11:22:56] <wikibugs>	 (03PS1) 10Awight: Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226)
[11:23:10] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deployment." [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight)
[11:23:49] <wikibugs>	 (03PS4) 10Awight: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[11:23:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight)
[11:23:58] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[11:24:49] <wikibugs>	 (03Merged) 10jenkins-bot: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch)
[11:25:09] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deployment." [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight)
[11:25:59] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:28:43] <logmsgbot>	 !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:717192|Set template namespace for code mirror line numbering (T290226)]] (duration: 00m 59s)
[11:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:49] <stashbot>	 T290226:  Default namespace for line numbering can not be unset - https://phabricator.wikimedia.org/T290226
[11:31:07] <wikibugs>	 (03Merged) 10jenkins-bot: Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight)
[11:33:46] <logmsgbot>	 !log awight@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/CodeMirror/extension.json: Backport: [[gerrit:719170|Change line numbers default to null (T290226)]] (duration: 00m 59s)
[11:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:51] <stashbot>	 T290226:  Default namespace for line numbering can not be unset - https://phabricator.wikimedia.org/T290226
[11:36:12] <awight>	 EU vegan bacon complete.
[11:36:20] <awight>	 !log EU backport complete
[11:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:29] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:39:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks for the datapoints!" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[11:40:01] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:40:53] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:41:01] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[11:45:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi)
[11:46:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi)
[11:46:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for 6 hosts
[11:46:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts
[11:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:06] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) 05Open→03Resolved I'm glad things are working @NForrester! Resolving
[11:47:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi) 05Open→03Resolved Great to hear @KLevan ! Resolving
[12:01:19] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:10:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool-data: add worker nodes to ml_serve [puppet] - 10https://gerrit.wikimedia.org/r/719225 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[12:14:22] <wikibugs>	 (03PS2) 10Elukey: conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835)
[12:14:24] <wikibugs>	 (03PS2) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835)
[12:14:34] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon)
[12:14:44] <wikibugs>	 (03PS6) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771)
[12:15:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[12:15:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[12:19:42] <wikibugs>	 (03PS3) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835)
[12:19:55] <icinga-wm>	 RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[12:21:38] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon)
[12:24:41] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17246 and previous config saved to /var/cache/conftool/dbconfig/20210907-122708-marostegui.json
[12:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:15] <stashbot>	 T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594
[12:27:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17247 and previous config saved to /var/cache/conftool/dbconfig/20210907-122747-marostegui.json
[12:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:32] <wikibugs>	 (03PS1) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719255 (https://phabricator.wikimedia.org/T289115)
[12:29:50] <wikibugs>	 (03Abandoned) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719255 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[12:35:13] <wikibugs>	 (03PS1) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115)
[12:36:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff)
[12:37:26] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[12:39:00] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[12:39:43] <wikibugs>	 (03Merged) 10jenkins-bot: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon)
[12:43:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[12:43:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add inference eqiad service record [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[12:44:55] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon)
[12:45:28] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2009.codfw.wmnet
[12:45:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: frdb2003: configure RAID, install OS, and add to fr-analytics db replication - https://phabricator.wikimedia.org/T290484 (10Jgreen)
[12:46:16] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon)
[12:48:11] <Emperor>	 jouncebot: now
[12:48:11] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 11 minute(s)
[12:51:42] <logmsgbot>	 !log mvernon@deploy1002 Synchronized wmf-config/ProductionServices.php: Remove old decommissioned pc hosts T284825 (duration: 01m 02s)
[12:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:46] <stashbot>	 T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825
[12:53:31] <wikibugs>	 (03CR) 10Vgutierrez: sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[12:59:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: test individual properties [software] - 10https://gerrit.wikimedia.org/r/719259
[12:59:38] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2009.codfw.wmnet
[12:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: test individual properties [software] - 10https://gerrit.wikimedia.org/r/719259 (owner: 10Filippo Giunchedi)
[13:02:31] <wikibugs>	 (03PS2) 10MVernon: pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117)
[13:02:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s8 weights T288594', diff saved to https://phabricator.wikimedia.org/P17248 and previous config saved to /var/cache/conftool/dbconfig/20210907-130244-marostegui.json
[13:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:49] <stashbot>	 T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594
[13:05:15] <wikibugs>	 (03PS1) 10Volans: icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260
[13:05:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete java::security [puppet] - 10https://gerrit.wikimedia.org/r/719261 (https://phabricator.wikimedia.org/T282454)
[13:05:57] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Thanks for opening this.  I am not an expert on this at all, but was involved in the deployment so had a little look.  The errors are odd, I've tested here...
[13:07:53] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[13:07:59] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) I have agreed with @Papaul to do this after the switchover.
[13:08:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgmt" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans)
[13:08:45] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[13:12:54] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10Volans) @cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java one from their backend (see https://docs....
[13:16:16] <wikibugs>	 (03CR) 10Muehlenhoff: "There are three major services using the hardened java.security settings: The IDPs (which I'll test in a bit). Looking at Debmonitor, Hado" [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff)
[13:18:24] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans)
[13:21:19] <Emperor>	 !log removing pc2009 from tendril and zarcillo T289116
[13:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:25] <stashbot>	 T289116: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116
[13:21:49] <Emperor>	 !log removing pc2009 from orchestrator T289116
[13:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:26] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans)
[13:24:32] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10MatthewVernon) a:05MatthewVernon→03Papaul
[13:24:43] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10MatthewVernon) This host is ready for DC-Ops to decommission
[13:25:45] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2010.codfw.wmnet
[13:25:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:03] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab::backup move backup cronjobs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[13:33:30] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485)
[13:35:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli)
[13:37:16] <wikibugs>	 (03CR) 10Dzahn: "I will amend to just create the class but not apply it." [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[13:37:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli)
[13:40:24] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli)
[13:40:46] <wikibugs>	 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10jcrespo) I would like to bring to your attention T138915. This is **not a current user of Swift**, but it seems like something like this, a misc-object storage cluster, would be the ideal location, rather than a relat...
[13:40:58] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2010.codfw.wmnet
[13:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:45] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon)
[13:41:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon)
[13:43:13] <wikibugs>	 (03PS1) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265
[13:43:16] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:46] <wikibugs>	 (03PS2) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007)
[13:46:00] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) 05Open→03Resolved @Dzahn I checked the server today i have no errors showing on A1 closing this task . IF we have the error again please reopen the task.  Thanks
[13:46:12] <wikibugs>	 (03PS3) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007)
[13:48:19] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) Thank you @Papaul I will repool the server.
[13:49:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2264.codfw.wmnet
[13:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:41] <mutante>	 !log mw2264 - scap pulled and repooled after T290242
[13:49:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:46] <stashbot>	 T290242: mw2264 went down - https://phabricator.wikimedia.org/T290242
[13:50:00] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:58] <jayme>	 !log uncordoned kubestage2001
[13:52:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[13:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:46] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:55:14] <jayme>	 that could be the consequence of my uncordon...maybe
[13:56:18] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2264 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:57:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul)
[13:57:15] <XioNoX>	 !log drain esams-eqiad for circuit maintenance - T288503
[13:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:52] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:58:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul)
[13:59:51] <Emperor>	 !log removing pc2010 from tendril and zarcillo T289117
[13:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:56] <stashbot>	 T289117: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117
[14:00:55] <wikibugs>	 (03PS6) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[14:01:16] <Emperor>	 !log removing pc2010 from orchestrator T289117
[14:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:52] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:49] <wikibugs>	 (03CR) 10Jbond: "Example:" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[14:03:16] <wikibugs>	 (03PS7) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[14:03:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in...
[14:03:56] <wikibugs>	 (03PS8) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[14:04:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[14:04:38] <icinga-wm>	 PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:48] <icinga-wm>	 PROBLEM - Check systemd state on db2114 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:52] <icinga-wm>	 PROBLEM - Check systemd state on mw1402 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:54] <icinga-wm>	 PROBLEM - Check systemd state on mw2290 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:55] <icinga-wm>	 PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:58] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:00] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:06] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:10] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:16] <icinga-wm>	 PROBLEM - Check systemd state on mw2274 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:18] <mutante>	 jbond: seems like related :)
[14:05:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:20] <icinga-wm>	 PROBLEM - Check systemd state on schema1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:20] <icinga-wm>	 PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:28] <icinga-wm>	 PROBLEM - Check systemd state on sessionstore1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:30] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:42] <icinga-wm>	 PROBLEM - Check systemd state on db1113 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:42] <icinga-wm>	 PROBLEM - Check systemd state on mw2253 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:46] <icinga-wm>	 PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:46] <icinga-wm>	 PROBLEM - Check systemd state on db1155 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:48] <icinga-wm>	 PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:48] <icinga-wm>	 PROBLEM - Check systemd state on backup1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:52] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:52] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:52] <icinga-wm>	 PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:52] <icinga-wm>	 PROBLEM - Check systemd state on db1147 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:54] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:04] <icinga-wm>	 PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:06] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:08] <icinga-wm>	 PROBLEM - Check systemd state on db1153 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:10] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:12] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:12] <icinga-wm>	 PROBLEM - Check systemd state on restbase1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:13] <mutante>	 like in the old days when we had non-summarized puppet reports. let me stop the bot
[14:06:14] <icinga-wm>	 PROBLEM - Check systemd state on mw1394 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:14] <icinga-wm>	 PROBLEM - Check systemd state on mw1377 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:16] <icinga-wm>	 PROBLEM - Check systemd state on sessionstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:20] <icinga-wm>	 PROBLEM - Check systemd state on kubestagetcd1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:25] <icinga-wm>	 PROBLEM - Check systemd state on restbase1020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1374 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:28] <icinga-wm>	 PROBLEM - Check systemd state on mw2330 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:30] <icinga-wm>	 PROBLEM - Check systemd state on cp5015 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:32] <icinga-wm>	 PROBLEM - Check systemd state on es2021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:32] <icinga-wm>	 PROBLEM - Check systemd state on es2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:32] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:34] <icinga-wm>	 PROBLEM - Check systemd state on mw2327 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:36] <icinga-wm>	 PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:36] <icinga-wm>	 PROBLEM - Check systemd state on db1096 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:36] <icinga-wm>	 PROBLEM - Check systemd state on rdb2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:52] <icinga-wm>	 PROBLEM - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:56] <icinga-wm>	 PROBLEM - Check systemd state on cp1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:58] <icinga-wm>	 PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:00] <icinga-wm>	 PROBLEM - Check systemd state on db2089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:00] <icinga-wm>	 PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:00] <icinga-wm>	 PROBLEM - Check systemd state on db1127 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:02] <icinga-wm>	 PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:04] <icinga-wm>	 PROBLEM - Check systemd state on kafka-main2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:06] <icinga-wm>	 PROBLEM - Check systemd state on mw1353 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:10] <icinga-wm>	 PROBLEM - Check systemd state on mc2023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:10] <icinga-wm>	 PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:21] <mutante>	 !log temp killed icinga-wm because of flooding 
[14:07:24] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10MatthewVernon) a:05MatthewVernon→03Papaul
[14:07:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1022.eqiad.wmnet ` The log can be found in...
[14:07:30] <wikibugs>	 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10MatthewVernon) This host is ready for DC-Ops to decommission
[14:07:36] <icinga-wm>	 PROBLEM - Check systemd state on logstash2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1310 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:36] <icinga-wm>	 PROBLEM - Check systemd state on dbproxy1014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:38] <icinga-wm>	 PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:40] <icinga-wm>	 PROBLEM - Check systemd state on mw1412 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:40] <icinga-wm>	 PROBLEM - Check systemd state on mw1450 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:42] <icinga-wm>	 PROBLEM - Check systemd state on kafka-main1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:42] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:44] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:44] <icinga-wm>	 PROBLEM - Check systemd state on mw2362 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:44] <icinga-wm>	 PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1453 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:45] <icinga-wm>	 PROBLEM - Check systemd state on mc1045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:46] <icinga-wm>	 PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:48] <icinga-wm>	 PROBLEM - Check systemd state on registry1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:48] <icinga-wm>	 PROBLEM - Check systemd state on pybal-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:48] <icinga-wm>	 PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:54] <icinga-wm>	 PROBLEM - Check systemd state on search-loader1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:08] <mutante>	 !log alert1001 - temp disabled puppet, stopped icinga-wm
[14:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1023.eqiad.wmnet ` The log can be found in...
[14:08:57] <mutante>	 jbond: ^ silenced it, can restart when needed
[14:09:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1024.eqiad.wmnet ` The log can be found in...
[14:09:22] <jynus>	 thanks, mutante I cannot be so fast
[14:09:44] <jbond>	 mutante: ack thanks
[14:15:06] <wikibugs>	 (03PS1) 10Marostegui: check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594)
[14:15:36] <wikibugs>	 (03PS1) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271
[14:15:52] <wikibugs>	 (03PS2) 10Marostegui: check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594)
[14:16:18] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10Volans) >>! In T290425#7336136, @Volans wrote: > @cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java...
[14:17:33] <marostegui>	 !log No more db maintenance on eqiad T288594
[14:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:38] <stashbot>	 T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594
[14:17:59] <XioNoX>	 Lumen circuit between eqiad and esams hot cut in progress
[14:18:39] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Hey @volans nice catch!  Let me see how it goes with rounded values.
[14:19:25] <XioNoX>	 time=81.643ms
[14:19:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594) (owner: 10Marostegui)
[14:19:29] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "missing parentheses for method call" [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond)
[14:19:39] <XioNoX>	 better than the 110ms fro mbefore
[14:20:29] <wikibugs>	 (03PS2) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271
[14:21:57] <wikibugs>	 (03PS3) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271
[14:22:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1023.eqiad.wmnet with reason: REIMAGE
[14:22:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1023.eqiad.wmnet with reason: REIMAGE
[14:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:51] <wikibugs>	 (03CR) 10Jbond: prometheus: fix regex when parsing git hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond)
[14:23:06] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1024.eqiad.wmnet with reason: REIMAGE
[14:23:08] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1024.eqiad.wmnet with reason: REIMAGE
[14:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond)
[14:23:54] <XioNoX>	 !log re-pool esams-eqiad - T288503
[14:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:28] <wikibugs>	 (03PS4) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007)
[14:24:30] <wikibugs>	 (03PS1) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272
[14:25:30] <wikibugs>	 (03PS1) 10Jbond: prometheus: fix strip [puppet] - 10https://gerrit.wikimedia.org/r/719273
[14:26:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719273 (owner: 10Jbond)
[14:27:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus: fix strip [puppet] - 10https://gerrit.wikimedia.org/r/719273 (owner: 10Jbond)
[14:28:16] <wikibugs>	 (03PS3) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885)
[14:28:40] <wikibugs>	 (03PS4) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885)
[14:29:57] <wikibugs>	 (03CR) 10Dzahn: "I could either merge it like this or abandon .. hmm..." [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[14:32:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[14:32:40] <wikibugs>	 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Spot on @volans works fine now: ` cmooney@wikilap:~/statograph_test$ statograph -v -c config.yaml upload_metrics INFO:statograph.uploader:Querying data for...
[14:33:11] <mutante>	 !log CI - migrating zuul-merger cronjob to systemd timer (contint*)
[14:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1024.eqiad.wmnet'] `  and were **ALL** successful.
[14:35:07] <wikibugs>	 (03PS1) 10Jbond: prometheous: use correct variable config_yaml vs config_file [puppet] - 10https://gerrit.wikimedia.org/r/719274
[14:35:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] prometheous: use correct variable config_yaml vs config_file [puppet] - 10https://gerrit.wikimedia.org/r/719274 (owner: 10Jbond)
[14:36:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1023.eqiad.wmnet'] `  and were **ALL** successful.
[14:37:23] <wikibugs>	 (03PS4) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835)
[14:38:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Try restarting rsyslog on package installation [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi)
[14:38:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] `
[14:38:30] <wikibugs>	 (03CR) 10Dzahn: "deployed! confirmed on contint1001 and contint2001:" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[14:38:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in...
[14:39:08] <wikibugs>	 (03PS5) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835)
[14:40:14] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) migrated zuul_repack (zuul::merger) on contint* servers
[14:40:31] <wikibugs>	 (03PS1) 10Ladsgroup: zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275
[14:41:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275 (owner: 10Ladsgroup)
[14:41:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Ladsgroup) Thanks! I added the patch to drop it.
[14:41:17] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719275 (owner: 10Ladsgroup)
[14:41:52] <wikibugs>	 (03PS2) 10Ladsgroup: zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275 (https://phabricator.wikimedia.org/T273673)
[14:42:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "checked if I can see any sign of ci::master being deployed on more than contint, like cloud, but see nothing" [puppet] - 10https://gerrit.wikimedia.org/r/719275 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[14:46:45] <wikibugs>	 (03CR) 10Jbond: create a generic class to clean the puppet client bucket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[14:48:10] <wikibugs>	 (03CR) 10Dzahn: create a generic class to clean the puppet client bucket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[14:48:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn)
[14:52:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[14:52:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Really nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 (owner: 10JMeybohm)
[14:56:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Overall it LGTM. One thing worth to add as comment may be the possibility to scale the number of pod replicas, but for the basic test/star" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) (owner: 10JMeybohm)
[14:59:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278
[15:00:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278 (owner: 10Alexandros Kosiaris)
[15:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278 (owner: 10Alexandros Kosiaris)
[15:04:09] <jbond>	 !log upload python-prometheus-client_0.6.0 to stretch-wikimedia
[15:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:14] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy)
[15:07:19] <wikibugs>	 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle)
[15:07:29] <wikibugs>	 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle)
[15:07:43] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 (owner: 10Ahmon Dancy)
[15:07:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:08:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1022.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudcephosd1022.eqiad.wmnet'] `
[15:09:05] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy)
[15:10:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) @dpifke Effie did some benchmarking today for which XHGui was needed. tideways is installed and enabled...
[15:10:32] <wikibugs>	 (03Merged) 10jenkins-bot: check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 (owner: 10Ahmon Dancy)
[15:16:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Update puppetised java.security file for Java 11.0.12 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff)
[15:18:54] <akosiaris>	 !log run_benchmarky.py against mwdebug.svc.codfw.wmnet for performance tests
[15:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:47] <wikibugs>	 (03CR) 10Muehlenhoff: Update puppetised java.security file for Java 11.0.12 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff)
[15:21:19] <dancy>	 Hi operations folks.  I think I may have gotten the 'jenkins' k8s user auto-banned on the staging cluster.   All k8s requests that I'm sending are being rejected with "Forbidden".  Can someone have a look?
[15:22:57] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere)
[15:23:32] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:36] <wikibugs>	 10SRE, 10Observability-Alerting, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10lmata)
[15:24:47] <wikibugs>	 (03PS1) 10Cathal Mooney: Added __post_init__ function to Datapoint class to round values to 9 decimal places.  This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425)
[15:24:50] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005)
[15:25:49] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Extend dpkg Icinga check to also check for inconsistent apt state - https://phabricator.wikimedia.org/T190693 (10lmata)
[15:26:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added __post_init__ function to Datapoint class to round values to 9 decimal places.  This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney)
[15:28:32] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31018/console" [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:30:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "pcc shows the expected DIFF at puppet level (added parameters to the haproxy class) and a NOOP at haproxy level" [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:31:14] <wikibugs>	 (03PS1) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[15:32:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[15:33:47] <wikibugs>	 (03PS2) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[15:34:33] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite)
[15:35:03] <wikibugs>	 (03PS3) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673)
[15:36:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:39:30] <wikibugs>	 (03PS2) 10Cathal Mooney: Added __post_init__ function to Datapoint class to round values to 9 decimal places.  This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425)
[15:40:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[15:40:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) 05Stalled→03Open p:05Lowest→03Medium postorius 1.3.5 was released, in addition to the unsubscribe security fix we already have: https://doc...
[15:40:11] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm)
[15:40:17] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm)
[15:40:23] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: In Mailman3, users cannot change their display name from the web - https://phabricator.wikimedia.org/T283128 (10Legoktm)
[15:40:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1371.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:44:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] `
[15:46:32] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:49:07] <wikibugs>	 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10jbond) >>! In T165885#7314759, @elukey wrote: > @jbond sure! Question - is there a problem with the /var/log/camus directories...
[15:49:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit on the commit message" [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney)
[15:56:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[15:56:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[15:57:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[15:57:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[16:00:05] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1600).
[16:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:19] <tgr>	 o/
[16:00:38] <jbond>	 tgr: looking now
[16:01:40] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM merging" [puppet] - 10https://gerrit.wikimedia.org/r/716755 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza)
[16:03:29] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Doesn't seem controversial" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 (owner: 10David Caro)
[16:05:03] <jbond>	 tgr: merged and deployed to mwmaint1002 https://phabricator.wikimedia.org/P17249
[16:05:12] <tgr>	 thanks jbond!
[16:05:17] <jbond>	 np, let me know if there is anything elses you needed
[16:09:49] <wikibugs>	 (03PS4) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371
[16:09:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki)
[16:10:14] <wikibugs>	 (03PS1) 10Dzahn: cloud/devtools: set docker::registry to localhost [puppet] - 10https://gerrit.wikimedia.org/r/719292
[16:12:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff)
[16:13:08] <wikibugs>	 (03PS3) 10Cathal Mooney: Round float values to a fixed precision [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425)
[16:13:53] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] "Merging." [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney)
[16:14:34] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki)
[16:15:38] <wikibugs>	 (03Merged) 10jenkins-bot: Round float values to a fixed precision [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney)
[16:18:06] <wikibugs>	 (03PS1) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885)
[16:18:18] <wikibugs>	 (03CR) 10Muehlenhoff: haproxy: Allow using a custom systemd::service template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[16:19:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[16:19:26] <wikibugs>	 (03PS2) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272
[16:19:28] <wikibugs>	 (03PS5) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007)
[16:19:30] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295
[16:19:32] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296
[16:20:37] <wikibugs>	 (03PS2) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885)
[16:21:45] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296
[16:23:04] <icinga-wm>	 PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:15] <wikibugs>	 (03PS1) 10Jgiannelos: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297
[16:26:28] <icinga-wm>	 PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cloud/devtools: set docker::registry to localhost [puppet] - 10https://gerrit.wikimedia.org/r/719292 (owner: 10Dzahn)
[16:30:23] <logmsgbot>	 !log dancy@deploy1002 Synchronized README: testing (duration: 00m 59s)
[16:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:49] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue
[16:30:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue
[16:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:00] <wikibugs>	 (03PS3) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885)
[16:32:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31022/console" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond)
[16:33:52] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295
[16:33:54] <wikibugs>	 (03PS3) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296
[16:36:08] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297 (owner: 10Jgiannelos)
[16:39:23] <wikibugs>	 (03Merged) 10jenkins-bot: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297 (owner: 10Jgiannelos)
[16:39:42] <wikibugs>	 (03PS1) 10Bstorm: quarry dbbackup: fix the script typo [puppet] - 10https://gerrit.wikimedia.org/r/719301 (https://phabricator.wikimedia.org/T289568)
[16:39:44] <moritzm>	 !log installing jetty9 security updates on buster
[16:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:35] <wikibugs>	 (03PS3) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295
[16:41:37] <wikibugs>	 (03PS4) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296
[16:41:41] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] quarry dbbackup: fix the script typo [puppet] - 10https://gerrit.wikimedia.org/r/719301 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm)
[16:42:56] <wikibugs>	 (03PS1) 10Jbond: P:base: drop broad dependency [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477)
[16:45:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31023/console" [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[16:46:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base: drop broad dependency [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond)
[16:48:18] <wikibugs>	 (03PS9) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[16:49:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[16:50:12] <wikibugs>	 (03CR) 10Gergő Tisza: Growth: Remove config that moved on-wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm)
[16:51:07] <wikibugs>	 (03PS10) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585)
[16:51:44] <icinga-wm>	 RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:05] <jouncebot>	 chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1700).
[17:01:28] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[17:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:02:46] <icinga-wm>	 RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:44] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[17:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:53] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[17:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:23] <wikibugs>	 (03PS1) 10Ahmon Dancy: ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038)
[17:29:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy)
[17:31:42] <wikibugs>	 (03PS2) 10Ahmon Dancy: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038)
[17:35:51] <wikibugs>	 (03PS3) 10Ahmon Dancy: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038)
[17:43:04] <wikibugs>	 (03CR) 10Jdlrobson: "To clarify: Should this be merged today or next Monday to make sure Italian is a group 1  wiki for next week's deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:43:46] <wikibugs>	 (03PS5) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803)
[17:47:07] <wikibugs>	 (03CR) 10RhinosF1: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson)
[17:48:44] <RhinosF1>	 Jdlrobson: hi
[17:49:00] <RhinosF1>	 I can explain better if you have Qs what I meant
[17:49:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[17:51:44] <wikibugs>	 (03PS6) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803)
[17:56:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[17:58:35] <wikibugs>	 (03PS7) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1800).
[18:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[18:02:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Our initial benchmarks that @akosiaris showed that k8s was slower than baremetal, while at higher concurrencies the difference between the two was smaller. We have observed our b...
[18:04:53] <wikibugs>	 (03CR) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[18:08:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 279 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:09:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:17:29] <wikibugs>	 (03PS3) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803)
[18:28:37] <wikibugs>	 (03CR) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[18:49:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson)
[18:57:52] <wikibugs>	 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki)
[18:58:02] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki)
[18:58:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) cloudcephosd1023 and 1024 installed and are set to staged.  1021 and 1022 both in C8 get stuck during the partitioning phase of the install.  I need to...
[19:03:24] <wikibugs>	 (03PS1) 10Bstorm: cloud nfs: Update the drbd config to allow buster+ [puppet] - 10https://gerrit.wikimedia.org/r/719326 (https://phabricator.wikimedia.org/T283385)
[19:04:08] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 51.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[19:06:04] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[19:07:40] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "I'm just going to merge this to unblock experiments, let me know if you have any nits to suggest, and I'll add that in another patch." [puppet] - 10https://gerrit.wikimedia.org/r/719326 (https://phabricator.wikimedia.org/T283385) (owner: 10Bstorm)
[19:18:54] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1128.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:19:08] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1138.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:27:30] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create coolest-tool-academy mailing list for Coolest Tool Award - https://phabricator.wikimedia.org/T290511 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done
[19:27:32] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] "See comments inline." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan)
[19:31:22] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] "I wonder if T178169 should even be considered valid.  The utilities in `cassandra-tools-wmf` were meant to support multi-instance (which h" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan)
[20:00:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[20:01:17] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks for the review!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[20:02:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[20:10:04] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[20:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[20:15:50] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[20:27:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] "Generally looks fine, just this note." [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn)
[20:29:19] <wikibugs>	 (03PS1) 10RLazarus: icinga: Add @services_downtimed decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356
[20:31:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:49] <wikibugs>	 (03PS2) 10Ladsgroup: Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE))
[20:35:08] <Amir1>	 jouncebot: now
[20:35:08] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 24 minute(s)
[20:35:17] <Amir1>	 cool deploying this patch above
[20:36:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE))
[20:37:12] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE))
[20:40:09] <Amir1>	 Tested on mwdebug2002, works fine, moving forward
[20:41:15] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715018|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis (T251480)]] (duration: 00m 59s)
[20:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:21] <stashbot>	 T251480: Normalize pagenames/filenames on save in Wikibase - https://phabricator.wikimedia.org/T251480
[20:52:42] <wikibugs>	 (03PS1) 10Brennen Bearnes: dev-images: migrate repository to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259)
[20:57:26] <wikibugs>	 (03CR) 10Ebernhardson: "PCC looks as expected, only concrete change on the hosts is adding the rewrite clauses to the conf files. https://puppet-compiler.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson)
[21:08:41] <wikibugs>	 (03PS1) 10Nikki Nikkhoui: Remove image suggestion api from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/719366 (https://phabricator.wikimedia.org/T288132)
[21:09:02] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm)
[21:10:58] <wikibugs>	 (03PS1) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368
[21:26:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:33:38] <wikibugs>	 (03PS1) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372
[21:39:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the follow up, 2 very minor nits on commit/docs, no need to review again" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus)
[21:57:16] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:01:56] <wikibugs>	 (03PS4) 10Legoktm: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus)
[22:03:18] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:09:42] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10wiki_willy) a:03Cmjohnson In warranty thru 2022-08-07
[22:10:42] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10wiki_willy) a:03Cmjohnson In warranty thru 2023-10-27
[22:21:52] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] dev-images: migrate repository to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes)
[22:24:53] <wikibugs>	 (03PS1) 10Andrew Bogott: vendordata: explicitly remove ephemeral0 from cloud-init mounting [puppet] - 10https://gerrit.wikimedia.org/r/719376 (https://phabricator.wikimedia.org/T290372)
[22:26:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] vendordata: explicitly remove ephemeral0 from cloud-init mounting [puppet] - 10https://gerrit.wikimedia.org/r/719376 (https://phabricator.wikimedia.org/T290372) (owner: 10Andrew Bogott)
[22:28:20] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema)
[22:48:52] <wikibugs>	 (03PS1) 10Ladsgroup: alertmanager: Send email on resolve for wikidata team [puppet] - 10https://gerrit.wikimedia.org/r/719380
[22:51:58] <icinga-wm>	 PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:53:11] <wikibugs>	 (03CR) 10Cwhite: puppetmaster: puppet prometheus reporting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond)
[22:54:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719107 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi)
[22:55:33] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] o11y: add udp receive errors for statsd [alerts] - 10https://gerrit.wikimedia.org/r/719123 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[22:55:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] statsd: remove statsd_udp_inbound_errors [puppet] - 10https://gerrit.wikimedia.org/r/719124 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[22:56:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi)
[22:57:22] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Remove image suggestion api from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/719366 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui)
[22:59:34] <wikibugs>	 (03PS1) 10Ladsgroup: Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925)
[23:00:03] <Amir1>	 jouncebot: now
[23:00:03] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T2300)
[23:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T2300).
[23:00:05] <jouncebot>	 dpifke: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:31] <dpifke>	 Here.  Patch is self service, will go ahead with it if noone else has anything that needs to go out.
[23:00:48] <Amir1>	 dpifke: please go ahead and let me know once you're done
[23:01:09] <Amir1>	 I'll be deploying url shortener stuff
[23:01:15] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+2] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[23:01:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) ms-be1067 D4 U33 CABLEID#11042 PORT36
[23:01:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr)
[23:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke)
[23:07:36] <logmsgbot>	 !log dpifke@deploy1002 Synchronized wmf-config/profiler.php: Config: [[gerrit:716041|profiler: use seperate pipeline inside k8s pods (T288165)]] (duration: 00m 58s)
[23:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:41] <stashbot>	 T288165: Create separate ArcLamp pipeline for k8s-mwdebug - https://phabricator.wikimedia.org/T288165
[23:08:41] <dpifke>	 Amri1: Done.
[23:08:48] <dpifke>	 Amir1: Done
[23:08:57] <Amir1>	 awesome!
[23:09:02] <Amir1>	 Thanks
[23:09:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925) (owner: 10Ladsgroup)
[23:10:22] <wikibugs>	 (03Merged) 10jenkins-bot: Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925) (owner: 10Ladsgroup)
[23:12:44] <Amir1>	 looks good on mwdebug2002, moving forward
[23:13:54] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:719381|Enable UrlShortener everywhere (T267925)]] (duration: 00m 58s)
[23:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:58] <stashbot>	 T267925: Allow displaying URL shortener link in sidebar for foreign wiki - https://phabricator.wikimedia.org/T267925
[23:15:49] <legoktm>	 gogogogo
[23:16:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr)
[23:16:40] <Amir1>	 legoktm: wohooo \o/
[23:19:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) Finished Provision a server's network attributes script on netbox configured bios handing over to rob for hopefully finishing
[23:20:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) a:05Jclark-ctr→03RobH
[23:20:51] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[23:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:49] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:34] <wikibugs>	 (03PS1) 10RobH: ms-be1067 updates [puppet] - 10https://gerrit.wikimedia.org/r/719384 (https://phabricator.wikimedia.org/T285808)
[23:38:17] <wikibugs>	 (03CR) 10RobH: [C: 03+2] ms-be1067 updates [puppet] - 10https://gerrit.wikimedia.org/r/719384 (https://phabricator.wikimedia.org/T285808) (owner: 10RobH)
[23:47:11] <wikibugs>	 (03PS3) 10Legoktm: mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:50:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH)
[23:52:56] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)
[23:54:30] <legoktm>	 Amir1: hm, puppet failed
[23:54:58] <legoktm>	 https://phabricator.wikimedia.org/P17251
[23:55:01] <Amir1>	 legoktm: hmm, how so?
[23:55:27] <Amir1>	 Hmm. Let me check 
[23:56:39] <wikibugs>	 (03PS1) 10Legoktm: mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386
[23:56:45] <legoktm>	 Amir1: ^ how's that look?
[23:57:11] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm)
[23:57:18] * legoktm waits for pcc
[23:57:30] <Amir1>	 legoktm: yeah much better, if it's one host, we can do this instead
[23:57:40] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31025/console" [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm)
[23:58:00] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm)
[23:58:18] <Amir1>	 in an hour, I'll make a patch to remove the stuff
[23:58:53] <legoktm>	 Notice: /Stage[main]/Mailman::Webui/File[/etc/mailman]: Not removing directory; use 'force' to override
[23:58:53] <legoktm>	 Notice: /Stage[main]/Mailman::Webui/File[/etc/mailman]/ensure: removed
[23:59:02] <legoktm>	 let me just do it manually
[23:59:17] <legoktm>	 I already checked the directory just to make sure it had nothing useful, it did not
[23:59:29] <Amir1>	 *pretends to be shocked*