[00:02:21] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:27] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:06:19] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:12:03] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:15:51] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:21:31] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the mos [00:21:31] rticles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:25:05] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:33:07] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:34:55] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: [00:34:55] {domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:36:47] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:42:31] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned th [00:42:31] cted status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:46:19] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:51:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [00:51:59] e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [00:56:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:58:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:11] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:57] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10Andrew) Strange -- after I made my changes last week I doublechecked that puppet was working properl... [01:16:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:25:53] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:07] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [01:33:57] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:39:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [01:41:33] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:47:15] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featu [01:47:15] e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [01:49:07] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:54:51] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [01:56:45] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:57:35] 10Puppet, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: Puppet failure on integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T290422 (10Andrew) 05Open→03Resolved a:03Andrew I still don't know what this was but I regenerated all th... [02:02:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158 [02:07:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158 (owner: 10TrainBranchBot) [02:08:15] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [02:10:09] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:21:33] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret [02:21:33] e unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article [02:21:33] nuary 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [02:24:03] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.22 [core] (wmf/1.37.0-wmf.22) - 10https://gerrit.wikimedia.org/r/719158 (owner: 10TrainBranchBot) [02:26:43] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:51] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:50:11] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [02:52:03] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [02:59:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200): /{doma [02:59:41] age/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [03:02:49] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:13:05] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [03:18:47] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret [03:18:47] e unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [03:25:37] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:53] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [03:43:35] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [03:56:53] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:01:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:37] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret [04:02:37] e unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:08:23] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:14:09] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:26:33] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:07] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:36:39] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:36:45] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:40:23] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:45:57] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 [04:45:57] ng: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:49:39] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:55:11] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned th [04:55:11] cted status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:57:03] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:01:41] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:37] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpecte [05:02:37] 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:06:25] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:08:32] (03PS1) 10Marostegui: Revert "db2090: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/719167 [05:14:55] !log Optimize kawiki.flaggedtemplates in eqiad T290057 [05:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:00] T290057: Optimize flaggedtemplates tables in production - https://phabricator.wikimedia.org/T290057 [05:15:20] !log Optimize vecwiki.flaggedtemplates in eqiad T290057 [05:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:26] !log Optimize eowiki.flaggedtemplates in eqiad T290057 [05:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the mos [05:15:59] rticles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News conten [05:15:59] ed the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [05:26:09] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:55] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:59:05] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 r [05:59:05] the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:01:35] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:09] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:10:15] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:10:37] 10SRE, 10Commons, 10Traffic-Icebox, 10Wikidata, and 4 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10eranroz) [06:11:26] 10SRE, 10Commons, 10Traffic-Icebox, 10Wikidata, and 4 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517 (10eranroz) This also applies to wikidata. [06:15:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: [06:15:59] {domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:23:35] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [06:26:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:05] (03CR) 10Marostegui: [C: 03+2] Revert "db2090: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/719167 (owner: 10Marostegui) [06:32:59] 10SRE, 10SRE-Access-Requests: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10ChristineDeKock) Thanks! It now works. [06:34:43] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [06:38:20] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) From 2021-09-04 restbase has been reporting a lot of connection errors (to what it seems Wikifeeds judgding from the URI): https://logstash.wikimedia.org/goto... [06:38:29] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:47:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 5%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17228 and previous config saved to /var/cache/conftool/dbconfig/20210907-064711-root.json [06:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:18] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [07:02:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 10%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17229 and previous config saved to /var/cache/conftool/dbconfig/20210907-070215-root.json [07:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:21] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [07:04:53] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add equinix maint support [software] - 10https://gerrit.wikimedia.org/r/717100 (owner: 10Filippo Giunchedi) [07:07:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Fixing db2118's pooling config T288244', diff saved to https://phabricator.wikimedia.org/P17230 and previous config saved to /var/cache/conftool/dbconfig/20210907-070702-kormat.json [07:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:08] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [07:07:24] !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 25%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17231 and previous config saved to /var/cache/conftool/dbconfig/20210907-070724-kormat.json [07:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:51] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10MSantos) The source of the failure could be this one in Wikifeeds https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.09.07?id=jmADv3sB9aenX452C... [07:13:08] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [07:13:08] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [07:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:42] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 25%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17232 and previous config saved to /var/cache/conftool/dbconfig/20210907-071719-root.json [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:25] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [07:20:31] (03PS1) 10Marostegui: Revert "mariadb: Set core sections to unidir replication." [puppet] - 10https://gerrit.wikimedia.org/r/719168 [07:21:17] (03CR) 10Marostegui: [C: 04-2] "Wait for the switch back" [puppet] - 10https://gerrit.wikimedia.org/r/719168 (owner: 10Marostegui) [07:22:28] !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 50%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17233 and previous config saved to /var/cache/conftool/dbconfig/20210907-072227-kormat.json [07:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:34] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [07:25:00] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:23] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) Thanks @MSantos Update from IRC: me and @JMeybohm noticed that in the k8s wikifeeds graphs, the rise of the errors (Sept 4th ~02:30 UTC) corresponded to a b... [07:27:19] 10SRE, 10SRE-Access-Requests: Replace christinedk old ssh public key with a new one - https://phabricator.wikimedia.org/T290279 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [07:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 50%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17234 and previous config saved to /var/cache/conftool/dbconfig/20210907-073222-root.json [07:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:29] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [07:34:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Start to pool db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17235 and previous config saved to /var/cache/conftool/dbconfig/20210907-073436-marostegui.json [07:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:24] !log +100G for prometheus/k8s codfw [07:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 75%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17236 and previous config saved to /var/cache/conftool/dbconfig/20210907-073731-kormat.json [07:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:36] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [07:46:56] (03CR) 10Filippo Giunchedi: [C: 03+1] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [07:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 75%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17237 and previous config saved to /var/cache/conftool/dbconfig/20210907-074726-root.json [07:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:31] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [07:49:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight for db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17238 and previous config saved to /var/cache/conftool/dbconfig/20210907-074901-marostegui.json [07:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:35] !log kormat@cumin1001 dbctl commit (dc=all): 'db2118 (re)pooling @ 100%: reimage to buster (now with fixed pool config) T288244', diff saved to https://phabricator.wikimedia.org/P17239 and previous config saved to /var/cache/conftool/dbconfig/20210907-075235-kormat.json [07:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:40] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [07:53:04] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "Thank you for the reviews!" [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [08:02:00] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2090 (re)pooling @ 100%: Slowly repool T288803', diff saved to https://phabricator.wikimedia.org/P17240 and previous config saved to /var/cache/conftool/dbconfig/20210907-080230-root.json [08:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:35] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [08:07:11] 10SRE, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Package uploaded and upgraded on thanos-fe hosts, resolving [08:09:47] (03CR) 10Klausman: Add revscoring-editquality as first ml-service to helmfile.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [08:16:08] (03PS1) 10MVernon: pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) [08:19:37] (03CR) 10Jcrespo: [C: 03+1] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon) [08:22:52] (03CR) 10Awight: Set template namespace for code mirror line numbering (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [08:22:59] (03PS3) 10Awight: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [08:23:05] (03PS2) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [08:24:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:24:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:07] (03CR) 10Filippo Giunchedi: "Thank you for the reviews! I've removed the POC status since code seems fine as-is, I've tested this in Pontoon and it works as expected. " [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [08:25:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:25:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:20] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:29] (03CR) 10Marostegui: [C: 03+1] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon) [08:29:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31015/console" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [08:29:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight for db2090 into API T288803', diff saved to https://phabricator.wikimedia.org/P17241 and previous config saved to /var/cache/conftool/dbconfig/20210907-082952-marostegui.json [08:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:57] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [08:31:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1008.eqiad.wmnet [08:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:38] (03CR) 10MVernon: [C: 03+2] pc1008: remove puppet entries for pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/719223 (https://phabricator.wikimedia.org/T289119) (owner: 10MVernon) [08:36:56] (03PS2) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) [08:37:12] (03CR) 10Filippo Giunchedi: Add patches to handle mmkubernetes and omfwd stats (031 comment) [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [08:37:20] (03PS1) 10Elukey: conftool-data: add worker nodes to ml_serve [puppet] - 10https://gerrit.wikimedia.org/r/719225 (https://phabricator.wikimedia.org/T289835) [08:37:22] (03PS1) 10Elukey: conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835) [08:37:49] (03CR) 10Filippo Giunchedi: [C: 03+2] Add patches to handle mmkubernetes and omfwd stats [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/715457 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [08:42:12] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1008.eqiad.wmnet [08:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] !log removing pc1008 from tendril and zarcillo T289119 [08:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:52] T289119: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 [08:46:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10MatthewVernon) This host is ready for DC-Ops to decommission [08:51:13] !log removing pc1008 from orchestrator T289119 [08:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:17] T289119: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 [08:53:01] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [08:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:37] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10MatthewVernon) a:05MatthewVernon→03wiki_willy [08:54:26] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10MatthewVernon) This host is ready for DC-Ops to decommission [08:57:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:34] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:55] (03PS1) 10Elukey: Add inference eqiad service record [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) [09:04:03] (03CR) 10Elukey: "Record already added in netbox (and cookbook executed)" [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [09:05:00] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs20 [09:05:00] .wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:08:11] (03CR) 10Volans: "All good on Netbox (assigned+reserved), thanks!" [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [09:08:36] (03PS1) 10MVernon: pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120) [09:09:16] (03CR) 10Vgutierrez: sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [09:12:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={jmx_wdqs_blazegraph,mysql-parsercache} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:25] (03CR) 10Marostegui: [C: 03+1] pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120) (owner: 10MVernon) [09:15:35] (03CR) 10MVernon: [C: 03+2] pc1009: remove puppet entries for pc1009 [puppet] - 10https://gerrit.wikimedia.org/r/719228 (https://phabricator.wikimedia.org/T289120) (owner: 10MVernon) [09:16:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1009.eqiad.wmnet [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:06] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs20 [09:19:06] .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:21:14] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:38] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:06] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:25:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1009.eqiad.wmnet [09:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:10] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 3.550 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:26:13] !log removing pc1009 from tendril and zarcillo T289120 [09:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:17] T289120: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 [09:27:16] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:31:29] (03PS1) 10Filippo Giunchedi: Try restarting rsyslog on package installation [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137) [09:37:06] (03PS1) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) [09:40:42] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:53] (03CR) 10Btullis: "Adding @Filippo for review, as this is a prometheus scrape target change." [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:42:53] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31016/console" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:43:21] (03PS1) 10Volans: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 [09:43:28] (03PS1) 10Elukey: istio: change ingress gateway nodeport to 4688 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719235 (https://phabricator.wikimedia.org/T289835) [09:44:43] (03CR) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:45:51] (03PS1) 10MMandere: varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639) [09:46:10] (03PS5) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [09:46:15] !log removing pc1009 from orchestrator T289120 [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:20] T289120: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 [09:46:20] (03PS2) 10Btullis: Add a promehtheus scrape target for the aqs_new role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) [09:46:25] (03CR) 10Filippo Giunchedi: [V: 03+1] sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [09:46:44] (03PS3) 10Btullis: Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) [09:48:21] (03CR) 10Filippo Giunchedi: "LGTM, see inline for nits" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:48:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:48:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31017/console" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:48:50] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10MatthewVernon) a:03wiki_willy [09:49:00] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10MatthewVernon) This host is ready for DC-Ops to decommission [09:49:51] (03PS4) 10Btullis: Add a promehtheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) [09:50:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719119 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:50:36] (03CR) 10Btullis: "Thanks for spotting that." [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:50:54] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:51:57] (03PS5) 10Btullis: Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) [09:51:59] (03PS1) 10MVernon: pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122) [09:52:07] (03CR) 10Btullis: Add a prometheus scrape target for the aqs_next role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:52:10] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:52:39] (03CR) 10Jbond: [C: 03+1] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [09:54:18] (03CR) 10Btullis: [C: 03+2] Add a prometheus scrape target for the aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/719233 (https://phabricator.wikimedia.org/T249755) (owner: 10Btullis) [09:58:09] (03CR) 10Kormat: [C: 03+1] pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122) (owner: 10MVernon) [10:01:48] (03CR) 10MVernon: [C: 03+2] pc1010: remove puppet entries for pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/719237 (https://phabricator.wikimedia.org/T289122) (owner: 10MVernon) [10:02:02] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc1010.eqiad.wmnet [10:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:23] (03CR) 10MMandere: [C: 03+2] puppetmaster: Add drmrs DC Site [puppet] - 10https://gerrit.wikimedia.org/r/719119 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:10:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans) [10:10:51] (03PS1) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [10:10:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/719135 (owner: 10Volans) [10:11:50] (03CR) 10Elukey: [C: 03+2] istio: change ingress gateway nodeport to 4688 [deployment-charts] - 10https://gerrit.wikimedia.org/r/719235 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [10:11:57] (03CR) 10jerkins-bot: [V: 04-1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [10:13:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [10:13:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans) [10:15:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc1010.eqiad.wmnet [10:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans) [10:16:03] (03PS1) 10MVernon: pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) [10:17:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans) [10:22:38] !log removing pc1010 from tendril and zarcillo T289122 [10:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:43] T289122: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 [10:23:58] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:00] !log removing pc1010 from orchestrator T289122 [10:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:52] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [10:27:52] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec [10:27:52] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:06] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [10:28:06] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec [10:28:06] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:18] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10MatthewVernon) a:05MatthewVernon→03wiki_willy [10:28:20] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [10:28:20] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec [10:28:20] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:28:30] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [10:28:30] unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpec [10:28:30] us 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:11] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10MatthewVernon) This host is ready for DC-Ops to decommission [10:29:18] ACKNOWLEDGEMENT - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie [10:29:18] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:18] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:18] ACKNOWLEDGEMENT - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie [10:29:19] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:19] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:21] ACKNOWLEDGEMENT - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie [10:29:22] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:22] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:24] ACKNOWLEDGEMENT - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page vie [10:29:25] ned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 404 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the [10:29:25] ted status 404 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Hnowlan Tables being reinitialised https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:29:38] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: commissioning aqs_new hosts [10:29:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: commissioning aqs_new hosts [10:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:47] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on 6 hosts with reason: commissioning aqs_new hosts [10:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 6 hosts with reason: commissioning aqs_new hosts [10:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:12] apologies for the noise :) [10:31:15] (03PS1) 10MVernon: pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) [10:32:24] (03CR) 10Volans: [C: 03+2] prospector: disable E203 for pep-8 over black [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans) [10:32:37] (03CR) 10Volans: [C: 03+2] style: if no local modifications check last commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans) [10:32:39] (03PS3) 10Jbond: puppet_agent_stats: add catalog version to prom metricts [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) [10:32:47] (03CR) 10Volans: [C: 03+2] ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [10:32:58] (03PS2) 10MVernon: pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) [10:33:18] (03CR) 10Kormat: [C: 03+1] pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [10:34:09] (03CR) 10Kormat: [C: 03+1] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon) [10:34:31] (03CR) 10MVernon: [C: 03+2] pc2008: remove puppet entries for pc2008 [puppet] - 10https://gerrit.wikimedia.org/r/719241 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [10:35:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2008.codfw.wmnet [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:30] (03Merged) 10jenkins-bot: prospector: disable E203 for pep-8 over black [software/spicerack] - 10https://gerrit.wikimedia.org/r/719140 (owner: 10Volans) [10:37:48] (03Merged) 10jenkins-bot: style: if no local modifications check last commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/719141 (owner: 10Volans) [10:38:52] (03Merged) 10jenkins-bot: ipmi: add status and reboot capabilities [software/spicerack] - 10https://gerrit.wikimedia.org/r/717251 (owner: 10Volans) [10:40:11] (03PS2) 10Volans: netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 [10:41:02] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:51] (03PS1) 10MVernon: pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) [10:46:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2008.codfw.wmnet [10:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:39] (03CR) 10Volans: [C: 03+2] netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans) [10:48:04] (03PS2) 10Volans: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 [10:49:31] !log removing pc2008 from tendril and zarcillo T289115 [10:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:36] T289115: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 [10:50:27] (03PS4) 10Volans: puppet_agent_stats: add catalog version to prom metrics [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [10:50:59] (03CR) 10Volans: "Sorry I had forgot to hit sent on the datapoints, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [10:51:11] !log removing pc2008 from orchestrator T289115 [10:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:56] (03Merged) 10jenkins-bot: netbox: add getter for the asset tag mgmt FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/719131 (owner: 10Volans) [10:55:31] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10MatthewVernon) a:05MatthewVernon→03Papaul [10:55:35] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10MatthewVernon) This host is ready for DC-Ops to decommission [10:57:17] (03CR) 10Jbond: puppet_agent_stats: add catalog version to prom metrics (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [10:57:52] (03CR) 10Volans: [C: 03+2] dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans) [10:58:16] (03PS2) 10MVernon: pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1100). [11:00:05] No Gerrit patches in the queue for this window AFAICS. [11:00:16] indeed, nothing to do :/ [11:03:28] (03Merged) 10jenkins-bot: dhcp: small refactor [software/spicerack] - 10https://gerrit.wikimedia.org/r/719234 (owner: 10Volans) [11:09:01] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10jbond) In relation to puppet i think we could look again at creating a puppet [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/143788/ |... [11:13:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:02] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate istio as an ingress for production usage - https://phabricator.wikimedia.org/T287007 (10JMeybohm) [11:16:27] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10NForrester) I can confirm that SSH access is working and the initial kerberos password has been changed. Thank you kindly for... [11:16:55] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10KLevan) With Nahid's help we have set up the Kerberos password and everything is working fine. Thank you all for your work. [11:18:48] urbanecm: I'll jump in with a minor patch, unless there's other activity? [11:19:04] awight: none that I'd be aware of -- go ahead. [11:19:46] :-) [11:22:56] (03PS1) 10Awight: Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) [11:23:10] (03CR) 10Awight: [C: 03+2] "Deployment." [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight) [11:23:49] (03PS4) 10Awight: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [11:23:56] (03CR) 10jerkins-bot: [V: 04-1] Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight) [11:23:58] (03CR) 10Awight: [C: 03+2] "Deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [11:24:49] (03Merged) 10jenkins-bot: Set template namespace for code mirror line numbering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717192 (https://phabricator.wikimedia.org/T290226) (owner: 10WMDE-Fisch) [11:25:09] (03CR) 10Awight: [C: 03+2] "Deployment." [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight) [11:25:59] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:43] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:717192|Set template namespace for code mirror line numbering (T290226)]] (duration: 00m 59s) [11:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:49] T290226: Default namespace for line numbering can not be unset - https://phabricator.wikimedia.org/T290226 [11:31:07] (03Merged) 10jenkins-bot: Change line numbers default to null [extensions/CodeMirror] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/719170 (https://phabricator.wikimedia.org/T290226) (owner: 10Awight) [11:33:46] !log awight@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/CodeMirror/extension.json: Backport: [[gerrit:719170|Change line numbers default to null (T290226)]] (duration: 00m 59s) [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:51] T290226: Default namespace for line numbering can not be unset - https://phabricator.wikimedia.org/T290226 [11:36:12] EU vegan bacon complete. [11:36:20] !log EU backport complete [11:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:29] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:39:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks for the datapoints!" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [11:40:01] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:40:53] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:41:01] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [11:45:54] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) [11:46:29] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi) [11:46:47] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for 6 hosts [11:46:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts [11:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:06] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) 05Open→03Resolved I'm glad things are working @NForrester! Resolving [11:47:12] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi) 05Open→03Resolved Great to hear @KLevan ! Resolving [12:01:19] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:25] (03CR) 10Elukey: [C: 03+2] conftool-data: add worker nodes to ml_serve [puppet] - 10https://gerrit.wikimedia.org/r/719225 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:14:22] (03PS2) 10Elukey: conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835) [12:14:24] (03PS2) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [12:14:34] (03CR) 10Kormat: [C: 03+1] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon) [12:14:44] (03PS6) 10Jgiannelos: Configure event stream for map tile state change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) [12:15:13] (03CR) 10jerkins-bot: [V: 04-1] role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:15:15] (03CR) 10Elukey: [C: 03+2] conftool-data: add new inference discovery service [puppet] - 10https://gerrit.wikimedia.org/r/719226 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:19:42] (03PS3) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [12:19:55] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:21:38] (03CR) 10Kormat: [C: 03+1] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon) [12:24:41] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17246 and previous config saved to /var/cache/conftool/dbconfig/20210907-122708-marostegui.json [12:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:15] T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594 [12:27:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17247 and previous config saved to /var/cache/conftool/dbconfig/20210907-122747-marostegui.json [12:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:32] (03PS1) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719255 (https://phabricator.wikimedia.org/T289115) [12:29:50] (03Abandoned) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719255 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [12:35:13] (03PS1) 10MVernon: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) [12:36:22] (03CR) 10Muehlenhoff: [C: 03+2] Add emacs-nox to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [12:37:26] (03CR) 10Kormat: [C: 03+1] wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [12:39:00] (03CR) 10MVernon: [C: 03+2] wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [12:39:43] (03Merged) 10jenkins-bot: wmf-config: remove old parsercache hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719257 (https://phabricator.wikimedia.org/T289115) (owner: 10MVernon) [12:43:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [12:43:40] (03CR) 10Elukey: [C: 03+2] Add inference eqiad service record [dns] - 10https://gerrit.wikimedia.org/r/719227 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [12:44:55] (03CR) 10MVernon: [C: 03+2] pc2009: remove puppet entries for pc2009 [puppet] - 10https://gerrit.wikimedia.org/r/719243 (https://phabricator.wikimedia.org/T289116) (owner: 10MVernon) [12:45:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2009.codfw.wmnet [12:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:42] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: frdb2003: configure RAID, install OS, and add to fr-analytics db replication - https://phabricator.wikimedia.org/T290484 (10Jgreen) [12:46:16] (03CR) 10MVernon: [C: 03+2] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon) [12:48:11] jouncebot: now [12:48:11] No deployments scheduled for the next 3 hour(s) and 11 minute(s) [12:51:42] !log mvernon@deploy1002 Synchronized wmf-config/ProductionServices.php: Remove old decommissioned pc hosts T284825 (duration: 01m 02s) [12:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:46] T284825: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 [12:53:31] (03CR) 10Vgutierrez: sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [12:59:24] (03PS1) 10Filippo Giunchedi: clinic-duty: test individual properties [software] - 10https://gerrit.wikimedia.org/r/719259 [12:59:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2009.codfw.wmnet [12:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:15] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: test individual properties [software] - 10https://gerrit.wikimedia.org/r/719259 (owner: 10Filippo Giunchedi) [13:02:31] (03PS2) 10MVernon: pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) [13:02:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'fix s8 weights T288594', diff saved to https://phabricator.wikimedia.org/P17248 and previous config saved to /var/cache/conftool/dbconfig/20210907-130244-marostegui.json [13:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:49] T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594 [13:05:15] (03PS1) 10Volans: icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 [13:05:18] (03PS1) 10Muehlenhoff: Remove obsolete java::security [puppet] - 10https://gerrit.wikimedia.org/r/719261 (https://phabricator.wikimedia.org/T282454) [13:05:57] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Thanks for opening this. I am not an expert on this at all, but was involved in the deployment so had a little look. The errors are odd, I've tested here... [13:07:53] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:07:59] 10SRE, 10ops-codfw, 10DBA: codfw: es2021: Correctable memory error rate exceeded for DIMM_A1 - https://phabricator.wikimedia.org/T290327 (10Marostegui) I have agreed with @Papaul to do this after the switchover. [13:08:32] (03CR) 10Jbond: [C: 03+1] "lgmt" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans) [13:08:45] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:22] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719056 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [13:12:54] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10Volans) @cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java one from their backend (see https://docs.... [13:16:16] (03CR) 10Muehlenhoff: "There are three major services using the hardened java.security settings: The IDPs (which I'll test in a bit). Looking at Debmonitor, Hado" [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [13:18:24] (03CR) 10Volans: [C: 03+2] icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans) [13:21:19] !log removing pc2009 from tendril and zarcillo T289116 [13:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:25] T289116: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 [13:21:49] !log removing pc2009 from orchestrator T289116 [13:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] (03Merged) 10jenkins-bot: icinga: remove deprecated Icinga class [software/spicerack] - 10https://gerrit.wikimedia.org/r/719260 (owner: 10Volans) [13:24:32] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10MatthewVernon) a:05MatthewVernon→03Papaul [13:24:43] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10MatthewVernon) This host is ready for DC-Ops to decommission [13:25:45] !log mvernon@cumin1001 START - Cookbook sre.hosts.decommission for hosts pc2010.codfw.wmnet [13:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:03] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:23] (03CR) 10Dzahn: [C: 03+1] gitlab::backup move backup cronjobs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:33:30] (03PS1) 10Effie Mouzeli: mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) [13:35:26] (03CR) 10JMeybohm: [C: 03+1] mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli) [13:37:16] (03CR) 10Dzahn: "I will amend to just create the class but not apply it." [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [13:37:27] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli) [13:40:24] (03Merged) 10jenkins-bot: mediawiki: add auto_prepend_file [deployment-charts] - 10https://gerrit.wikimedia.org/r/719264 (https://phabricator.wikimedia.org/T290485) (owner: 10Effie Mouzeli) [13:40:46] 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10jcrespo) I would like to bring to your attention T138915. This is **not a current user of Swift**, but it seems like something like this, a misc-object storage cluster, would be the ideal location, rather than a relat... [13:40:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts pc2010.codfw.wmnet [13:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] (03CR) 10Kormat: [C: 03+1] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon) [13:41:57] (03CR) 10MVernon: [C: 03+2] pc2010: remove puppet entries for pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/719244 (https://phabricator.wikimedia.org/T289117) (owner: 10MVernon) [13:43:13] (03PS1) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 [13:43:16] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:46] (03PS2) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) [13:46:00] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) 05Open→03Resolved @Dzahn I checked the server today i have no errors showing on A1 closing this task . IF we have the error again please reopen the task. Thanks [13:46:12] (03PS3) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) [13:48:19] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) Thank you @Papaul I will repool the server. [13:49:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2264.codfw.wmnet [13:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:41] !log mw2264 - scap pulled and repooled after T290242 [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:46] T290242: mw2264 went down - https://phabricator.wikimedia.org/T290242 [13:50:00] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] !log uncordoned kubestage2001 [13:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:05] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:46] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:55:14] that could be the consequence of my uncordon...maybe [13:56:18] RECOVERY - mediawiki-installation DSH group on mw2264 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:57:04] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2009.codfw.wmnet - https://phabricator.wikimedia.org/T289116 (10Papaul) [13:57:15] !log drain esams-eqiad for circuit maintenance - T288503 [13:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:58:02] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc2008.codfw.wmnet - https://phabricator.wikimedia.org/T289115 (10Papaul) [13:59:51] !log removing pc2010 from tendril and zarcillo T289117 [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] T289117: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 [14:00:55] (03PS6) 10Jbond: wmflib: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [14:01:16] !log removing pc2010 from orchestrator T289117 [14:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:52] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:49] (03CR) 10Jbond: "Example:" [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [14:03:16] (03PS7) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [14:03:49] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in... [14:03:56] (03PS8) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [14:04:36] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [14:04:38] PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:38] PROBLEM - Check systemd state on an-worker1136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:48] PROBLEM - Check systemd state on db2114 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:52] PROBLEM - Check systemd state on mw1402 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:54] PROBLEM - Check systemd state on mw2290 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:55] PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:58] PROBLEM - Check systemd state on ganeti1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:00] PROBLEM - Check systemd state on kafka-test1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:06] PROBLEM - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:10] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:16] PROBLEM - Check systemd state on mw2274 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:18] jbond: seems like related :) [14:05:18] PROBLEM - Check systemd state on an-worker1096 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:20] PROBLEM - Check systemd state on schema1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:20] PROBLEM - Check systemd state on cp2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:28] PROBLEM - Check systemd state on sessionstore1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:30] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:42] PROBLEM - Check systemd state on db1113 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:42] PROBLEM - Check systemd state on mw2253 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:46] PROBLEM - Check systemd state on cp2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:46] PROBLEM - Check systemd state on db1155 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:48] PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:48] PROBLEM - Check systemd state on backup1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] PROBLEM - Check systemd state on db1147 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:54] PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:04] PROBLEM - Check systemd state on mx1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:06] PROBLEM - Check systemd state on ganeti1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:08] PROBLEM - Check systemd state on db1153 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:10] PROBLEM - Check systemd state on ganeti2014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:10] PROBLEM - Check systemd state on an-worker1139 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:12] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:12] PROBLEM - Check systemd state on restbase1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:13] like in the old days when we had non-summarized puppet reports. let me stop the bot [14:06:14] PROBLEM - Check systemd state on mw1394 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:14] PROBLEM - Check systemd state on mw1377 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:16] PROBLEM - Check systemd state on sessionstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:20] PROBLEM - Check systemd state on kubestagetcd1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:25] PROBLEM - Check systemd state on restbase1020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:26] PROBLEM - Check systemd state on kubernetes1017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:26] PROBLEM - Check systemd state on mw1374 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:28] PROBLEM - Check systemd state on mw2330 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:30] PROBLEM - Check systemd state on cp5015 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:32] PROBLEM - Check systemd state on es2021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:32] PROBLEM - Check systemd state on es2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:32] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:34] PROBLEM - Check systemd state on mw2327 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:36] PROBLEM - Check systemd state on mw2252 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:36] PROBLEM - Check systemd state on db1096 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:36] PROBLEM - Check systemd state on rdb2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:52] PROBLEM - Check systemd state on wtp1044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:56] PROBLEM - Check systemd state on cp1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:58] PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:00] PROBLEM - Check systemd state on db2089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:00] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:00] PROBLEM - Check systemd state on db1127 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:02] PROBLEM - Check systemd state on cp3058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:04] PROBLEM - Check systemd state on kafka-main2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:06] PROBLEM - Check systemd state on mw1353 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:10] PROBLEM - Check systemd state on mc2023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:10] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:21] !log temp killed icinga-wm because of flooding [14:07:24] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10MatthewVernon) a:05MatthewVernon→03Papaul [14:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1022.eqiad.wmnet ` The log can be found in... [14:07:30] 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission pc2010.codfw.wmnet - https://phabricator.wikimedia.org/T289117 (10MatthewVernon) This host is ready for DC-Ops to decommission [14:07:36] PROBLEM - Check systemd state on logstash2034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:36] PROBLEM - Check systemd state on mw1310 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:36] PROBLEM - Check systemd state on dbproxy1014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:38] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:40] PROBLEM - Check systemd state on mw1412 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:40] PROBLEM - Check systemd state on mw1450 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:42] PROBLEM - Check systemd state on kafka-main1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:42] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:44] PROBLEM - Check systemd state on ganeti2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:44] PROBLEM - Check systemd state on mw2362 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:44] PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:44] PROBLEM - Check systemd state on mw1453 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:45] PROBLEM - Check systemd state on mc1045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:46] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:48] PROBLEM - Check systemd state on registry1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:48] PROBLEM - Check systemd state on pybal-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:48] PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:48] PROBLEM - Check systemd state on registry2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:54] PROBLEM - Check systemd state on search-loader1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:08] !log alert1001 - temp disabled puppet, stopped icinga-wm [14:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:32] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1023.eqiad.wmnet ` The log can be found in... [14:08:57] jbond: ^ silenced it, can restart when needed [14:09:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1024.eqiad.wmnet ` The log can be found in... [14:09:22] thanks, mutante I cannot be so fast [14:09:44] mutante: ack thanks [14:15:06] (03PS1) 10Marostegui: check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594) [14:15:36] (03PS1) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271 [14:15:52] (03PS2) 10Marostegui: check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594) [14:16:18] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10Volans) >>! In T290425#7336136, @Volans wrote: > @cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java... [14:17:33] !log No more db maintenance on eqiad T288594 [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:38] T288594: Pre DC switchover codfw -> eqiad DB work - https://phabricator.wikimedia.org/T288594 [14:17:59] Lumen circuit between eqiad and esams hot cut in progress [14:18:39] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Hey @volans nice catch! Let me see how it goes with rounded values. [14:19:25] time=81.643ms [14:19:27] (03CR) 10Marostegui: [C: 03+2] check_flags_per_dc.sh: One liner to check a few things [software] - 10https://gerrit.wikimedia.org/r/719270 (https://phabricator.wikimedia.org/T288594) (owner: 10Marostegui) [14:19:29] (03CR) 10Volans: [C: 04-1] "missing parentheses for method call" [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond) [14:19:39] better than the 110ms fro mbefore [14:20:29] (03PS2) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271 [14:21:57] (03PS3) 10Jbond: prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271 [14:22:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1023.eqiad.wmnet with reason: REIMAGE [14:22:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1023.eqiad.wmnet with reason: REIMAGE [14:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] (03CR) 10Jbond: prometheus: fix regex when parsing git hash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond) [14:23:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1024.eqiad.wmnet with reason: REIMAGE [14:23:08] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcephosd1024.eqiad.wmnet with reason: REIMAGE [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:42] (03CR) 10Jbond: [C: 03+2] prometheus: fix regex when parsing git hash [puppet] - 10https://gerrit.wikimedia.org/r/719271 (owner: 10Jbond) [14:23:54] !log re-pool esams-eqiad - T288503 [14:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:28] (03PS4) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) [14:24:30] (03PS1) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 [14:25:30] (03PS1) 10Jbond: prometheus: fix strip [puppet] - 10https://gerrit.wikimedia.org/r/719273 [14:26:48] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719273 (owner: 10Jbond) [14:27:03] (03CR) 10Jbond: [C: 03+2] prometheus: fix strip [puppet] - 10https://gerrit.wikimedia.org/r/719273 (owner: 10Jbond) [14:28:16] (03PS3) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) [14:28:40] (03PS4) 10Dzahn: create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) [14:29:57] (03CR) 10Dzahn: "I could either merge it like this or abandon .. hmm..." [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [14:32:36] (03CR) 10Dzahn: [C: 03+2] zuul: migrate cron of zuul_repack to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:32:40] 10SRE, 10SRE-OnFire, 10observability, 10User-jbond: statograph_post service fail on alert hosts - https://phabricator.wikimedia.org/T290425 (10cmooney) Spot on @volans works fine now: ` cmooney@wikilap:~/statograph_test$ statograph -v -c config.yaml upload_metrics INFO:statograph.uploader:Querying data for... [14:33:11] !log CI - migrating zuul-merger cronjob to systemd timer (contint*) [14:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:23] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1024.eqiad.wmnet'] ` and were **ALL** successful. [14:35:07] (03PS1) 10Jbond: prometheous: use correct variable config_yaml vs config_file [puppet] - 10https://gerrit.wikimedia.org/r/719274 [14:35:24] (03CR) 10Jbond: [V: 03+2 C: 03+2] prometheous: use correct variable config_yaml vs config_file [puppet] - 10https://gerrit.wikimedia.org/r/719274 (owner: 10Jbond) [14:36:10] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1023.eqiad.wmnet'] ` and were **ALL** successful. [14:37:23] (03PS4) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [14:38:09] (03CR) 10Filippo Giunchedi: [C: 03+2] Try restarting rsyslog on package installation [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/719231 (https://phabricator.wikimedia.org/T210137) (owner: 10Filippo Giunchedi) [14:38:24] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] ` [14:38:30] (03CR) 10Dzahn: "deployed! confirmed on contint1001 and contint2001:" [puppet] - 10https://gerrit.wikimedia.org/r/711197 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:38:38] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1021.eqiad.wmnet ` The log can be found in... [14:39:08] (03PS5) 10Elukey: role::ml_k8s::worker: add LVS configuration for the inference svc [puppet] - 10https://gerrit.wikimedia.org/r/719239 (https://phabricator.wikimedia.org/T289835) [14:40:14] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) migrated zuul_repack (zuul::merger) on contint* servers [14:40:31] (03PS1) 10Ladsgroup: zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275 [14:41:10] (03CR) 10jerkins-bot: [V: 04-1] zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275 (owner: 10Ladsgroup) [14:41:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Ladsgroup) Thanks! I added the patch to drop it. [14:41:17] (03CR) 10Zabe: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/719275 (owner: 10Ladsgroup) [14:41:52] (03PS2) 10Ladsgroup: zuul: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/719275 (https://phabricator.wikimedia.org/T273673) [14:42:41] (03CR) 10Dzahn: [C: 03+2] "checked if I can see any sign of ci::master being deployed on more than contint, like cloud, but see nothing" [puppet] - 10https://gerrit.wikimedia.org/r/719275 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:46:45] (03CR) 10Jbond: create a generic class to clean the puppet client bucket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [14:48:10] (03CR) 10Dzahn: create a generic class to clean the puppet client bucket (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [14:48:12] (03CR) 10Dzahn: [C: 03+2] create a generic class to clean the puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [14:52:04] (03CR) 10Filippo Giunchedi: [V: 03+1] sslcert: additional search paths for certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [14:52:13] (03CR) 10Elukey: [C: 03+1] "Really nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 (owner: 10JMeybohm) [14:56:53] (03CR) 10Elukey: [C: 03+1] "Overall it LGTM. One thing worth to add as comment may be the possibility to scale the number of pod replicas, but for the basic test/star" [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) (owner: 10JMeybohm) [14:59:46] (03PS1) 10Alexandros Kosiaris: mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278 [15:00:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278 (owner: 10Alexandros Kosiaris) [15:02:56] (03Merged) 10jenkins-bot: mwdebug: Add IPv6 addresses of etcd servers to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/719278 (owner: 10Alexandros Kosiaris) [15:04:09] !log upload python-prometheus-client_0.6.0 to stretch-wikimedia [15:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:14] (03CR) 10Ahmon Dancy: [C: 03+2] mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [15:07:19] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) [15:07:29] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) [15:07:43] (03CR) 10Ahmon Dancy: [C: 03+2] check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 (owner: 10Ahmon Dancy) [15:07:59] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:12] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1022.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1022.eqiad.wmnet'] ` [15:09:05] (03Merged) 10jenkins-bot: mediawiki-dev: Adjustments to allow for clean "rake helm_diff" run [deployment-charts] - 10https://gerrit.wikimedia.org/r/717621 (owner: 10Ahmon Dancy) [15:10:25] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10Krinkle) @dpifke Effie did some benchmarking today for which XHGui was needed. tideways is installed and enabled... [15:10:32] (03Merged) 10jenkins-bot: check_binary: Improve error message [deployment-charts] - 10https://gerrit.wikimedia.org/r/717605 (owner: 10Ahmon Dancy) [15:16:29] (03CR) 10Btullis: [C: 03+1] Update puppetised java.security file for Java 11.0.12 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [15:18:54] !log run_benchmarky.py against mwdebug.svc.codfw.wmnet for performance tests [15:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:47] (03CR) 10Muehlenhoff: Update puppetised java.security file for Java 11.0.12 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719064 (owner: 10Muehlenhoff) [15:21:19] Hi operations folks. I think I may have gotten the 'jenkins' k8s user auto-banned on the staging cluster. All k8s requests that I'm sending are being rejected with "Forbidden". Can someone have a look? [15:22:57] (03CR) 10Ema: [C: 03+1] varnish: Remove Vagrant test scripts [puppet] - 10https://gerrit.wikimedia.org/r/719236 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [15:23:32] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:36] 10SRE, 10Observability-Alerting, 10Traffic: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10lmata) [15:24:47] (03PS1) 10Cathal Mooney: Added __post_init__ function to Datapoint class to round values to 9 decimal places. This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) [15:24:50] (03PS1) 10Vgutierrez: haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) [15:25:49] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Extend dpkg Icinga check to also check for inconsistent apt state - https://phabricator.wikimedia.org/T190693 (10lmata) [15:26:01] (03CR) 10jerkins-bot: [V: 04-1] Added __post_init__ function to Datapoint class to round values to 9 decimal places. This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney) [15:28:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31018/console" [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:30:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "pcc shows the expected DIFF at puppet level (added parameters to the haproxy class) and a NOOP at haproxy level" [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:31:14] (03PS1) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [15:32:02] (03CR) 10jerkins-bot: [V: 04-1] swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [15:33:47] (03PS2) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [15:34:33] (03CR) 10Cwhite: [C: 03+2] logstash: route alertmanager logs to alerts index [puppet] - 10https://gerrit.wikimedia.org/r/717442 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [15:35:03] (03PS3) 10Dzahn: swift: convert dispersion stats cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) [15:36:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:39:30] (03PS2) 10Cathal Mooney: Added __post_init__ function to Datapoint class to round values to 9 decimal places. This is required to avoid apparent limit on what the statuspage.io API will accept. [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) [15:40:03] (03CR) 10Filippo Giunchedi: [C: 03+1] haproxy: Allow using a custom systemd::service template [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:40:08] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) 05Stalled→03Open p:05Lowest→03Medium postorius 1.3.5 was released, in addition to the unsubscribe security fix we already have: https://doc... [15:40:11] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm) [15:40:17] 10SRE, 10Wikimedia-Mailing-lists: Poor link parsing in HyperKitty (Mailman 3) web archive - https://phabricator.wikimedia.org/T283909 (10Legoktm) [15:40:23] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: In Mailman3, users cannot change their display name from the web - https://phabricator.wikimedia.org/T283128 (10Legoktm) [15:40:58] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1371.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:44:41] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1021.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudcephosd1021.eqiad.wmnet'] ` [15:46:32] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:49:07] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10jbond) >>! In T165885#7314759, @elukey wrote: > @jbond sure! Question - is there a problem with the /var/log/camus directories... [15:49:17] (03CR) 10Volans: [C: 03+1] "LGTM, nit on the commit message" [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney) [15:56:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1007.eqiad.wmnet. - https://phabricator.wikimedia.org/T289118 (10wiki_willy) a:05wiki_willy→03Cmjohnson [15:56:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1008.eqiad.wmnet - https://phabricator.wikimedia.org/T289119 (10wiki_willy) a:05wiki_willy→03Cmjohnson [15:57:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission pc1009.eqiad.wmnet - https://phabricator.wikimedia.org/T289120 (10wiki_willy) a:05wiki_willy→03Cmjohnson [15:57:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission pc1010.eqiad.wmnet - https://phabricator.wikimedia.org/T289122 (10wiki_willy) a:05wiki_willy→03Cmjohnson [16:00:05] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:19] o/ [16:00:38] tgr: looking now [16:01:40] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:54] (03CR) 10Jbond: [C: 03+2] "LGTM merging" [puppet] - 10https://gerrit.wikimedia.org/r/716755 (https://phabricator.wikimedia.org/T283868) (owner: 10Gergő Tisza) [16:03:29] (03CR) 10Bstorm: [C: 03+1] "Doesn't seem controversial" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/713812 (owner: 10David Caro) [16:05:03] tgr: merged and deployed to mwmaint1002 https://phabricator.wikimedia.org/P17249 [16:05:12] thanks jbond! [16:05:17] np, let me know if there is anything elses you needed [16:09:49] (03PS4) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [16:09:55] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki) [16:10:14] (03PS1) 10Dzahn: cloud/devtools: set docker::registry to localhost [puppet] - 10https://gerrit.wikimedia.org/r/719292 [16:12:17] (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [16:13:08] (03PS3) 10Cathal Mooney: Round float values to a fixed precision [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) [16:13:53] (03CR) 10Cathal Mooney: [C: 03+2] "Merging." [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney) [16:14:34] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [16:15:38] (03Merged) 10jenkins-bot: Round float values to a fixed precision [software/statograph] - 10https://gerrit.wikimedia.org/r/719281 (https://phabricator.wikimedia.org/T290425) (owner: 10Cathal Mooney) [16:18:06] (03PS1) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [16:18:18] (03CR) 10Muehlenhoff: haproxy: Allow using a custom systemd::service template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719282 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:19:06] (03CR) 10jerkins-bot: [V: 04-1] P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [16:19:26] (03PS2) 10JMeybohm: Rakefile: Add task validate_istio_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/719272 [16:19:28] (03PS5) 10JMeybohm: custom_deploy: Add istio manifest for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/719265 (https://phabricator.wikimedia.org/T287007) [16:19:30] (03PS1) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 [16:19:32] (03PS1) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [16:20:37] (03PS2) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [16:21:45] (03PS2) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [16:23:04] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:15] (03PS1) 10Jgiannelos: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297 [16:26:28] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:33] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: set docker::registry to localhost [puppet] - 10https://gerrit.wikimedia.org/r/719292 (owner: 10Dzahn) [16:30:23] !log dancy@deploy1002 Synchronized README: testing (duration: 00m 59s) [16:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [16:30:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue [16:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:00] (03PS3) 10Jbond: P:puppet: Add alerting for large files in client bucket [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) [16:32:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31022/console" [puppet] - 10https://gerrit.wikimedia.org/r/719293 (https://phabricator.wikimedia.org/T165885) (owner: 10Jbond) [16:33:52] (03PS2) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 [16:33:54] (03PS3) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [16:36:08] (03CR) 10Jgiannelos: [C: 03+2] push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297 (owner: 10Jgiannelos) [16:39:23] (03Merged) 10jenkins-bot: push-notifications: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/719297 (owner: 10Jgiannelos) [16:39:42] (03PS1) 10Bstorm: quarry dbbackup: fix the script typo [puppet] - 10https://gerrit.wikimedia.org/r/719301 (https://phabricator.wikimedia.org/T289568) [16:39:44] !log installing jetty9 security updates on buster [16:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:35] (03PS3) 10JMeybohm: admin_ng: Support managing of system namespaces with helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/719295 [16:41:37] (03PS4) 10JMeybohm: admin_ng/main: Create istio-system namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/719296 [16:41:41] (03CR) 10Bstorm: [C: 03+2] quarry dbbackup: fix the script typo [puppet] - 10https://gerrit.wikimedia.org/r/719301 (https://phabricator.wikimedia.org/T289568) (owner: 10Bstorm) [16:42:56] (03PS1) 10Jbond: P:base: drop broad dependency [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) [16:45:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31023/console" [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [16:46:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base: drop broad dependency [puppet] - 10https://gerrit.wikimedia.org/r/719302 (https://phabricator.wikimedia.org/T244477) (owner: 10Jbond) [16:48:18] (03PS9) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [16:49:02] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [16:50:12] (03CR) 10Gergő Tisza: Growth: Remove config that moved on-wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717039 (https://phabricator.wikimedia.org/T290295) (owner: 10Urbanecm) [16:51:07] (03PS10) 10Jbond: puppetmaster: puppet prometheus reporting [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) [16:51:44] RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1700). [17:01:28] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:46] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:44] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:53] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [17:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:23] (03PS1) 10Ahmon Dancy: ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) [17:29:52] (03CR) 10jerkins-bot: [V: 04-1] ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) (owner: 10Ahmon Dancy) [17:31:42] (03PS2) 10Ahmon Dancy: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) [17:35:51] (03PS3) 10Ahmon Dancy: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force [puppet] - 10https://gerrit.wikimedia.org/r/719307 (https://phabricator.wikimedia.org/T290038) [17:43:04] (03CR) 10Jdlrobson: "To clarify: Should this be merged today or next Monday to make sure Italian is a group 1 wiki for next week's deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:43:46] (03PS5) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [17:47:07] (03CR) 10RhinosF1: Italian Wikipedia is now a group 1 wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715571 (https://phabricator.wikimedia.org/T286664) (owner: 10Jdlrobson) [17:48:44] Jdlrobson: hi [17:49:00] I can explain better if you have Qs what I meant [17:49:34] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [17:51:44] (03PS6) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [17:56:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [17:58:35] (03PS7) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) [18:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T1800). [18:00:04] No Gerrit patches in the queue for this window AFAICS. [18:02:12] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Our initial benchmarks that @akosiaris showed that k8s was slower than baremetal, while at higher concurrencies the difference between the two was smaller. We have observed our b... [18:04:53] (03CR) 10RLazarus: icinga: Add downtime_services and remove_service_downtimes (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:08:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 279 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:09:56] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:17:29] (03PS3) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) [18:28:37] (03CR) 10RLazarus: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:49:05] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) [18:57:52] 10SRE, 10MW-on-K8s, 10Performance-Team, 10WikimediaDebug, 10serviceops: Ensure WikimediaDebug "log" and "profile" features work with k8s-mwdebug - https://phabricator.wikimedia.org/T288164 (10jijiki) [18:58:02] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [18:58:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) cloudcephosd1023 and 1024 installed and are set to staged. 1021 and 1022 both in C8 get stuck during the partitioning phase of the install. I need to... [19:03:24] (03PS1) 10Bstorm: cloud nfs: Update the drbd config to allow buster+ [puppet] - 10https://gerrit.wikimedia.org/r/719326 (https://phabricator.wikimedia.org/T283385) [19:04:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 51.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:06:04] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:07:40] (03CR) 10Bstorm: [C: 03+2] "I'm just going to merge this to unblock experiments, let me know if you have any nits to suggest, and I'll add that in another patch." [puppet] - 10https://gerrit.wikimedia.org/r/719326 (https://phabricator.wikimedia.org/T283385) (owner: 10Bstorm) [19:18:54] PROBLEM - MariaDB Replica Lag: s4 on db1150 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1128.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:19:08] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1138.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:27:30] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create coolest-tool-academy mailing list for Coolest Tool Award - https://phabricator.wikimedia.org/T290511 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done [19:27:32] (03CR) 10Eevans: [C: 04-1] "See comments inline." [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [19:31:22] (03CR) 10Eevans: [C: 04-1] "I wonder if T178169 should even be considered valid. The utilities in `cassandra-tools-wmf` were meant to support multi-instance (which h" [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/719051 (https://phabricator.wikimedia.org/T178169) (owner: 10Hnowlan) [20:00:33] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:01:17] (03CR) 10RLazarus: [C: 03+2] "Thanks for the review!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:02:57] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:10:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:10:05] (03Merged) 10jenkins-bot: icinga: Add downtime_services and remove_service_downtimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/718935 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [20:15:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:27:38] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:57] (03CR) 10Ladsgroup: [C: 04-1] "Generally looks fine, just this note." [puppet] - 10https://gerrit.wikimedia.org/r/719285 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:29:19] (03PS1) 10RLazarus: icinga: Add @services_downtimed decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 [20:31:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:49] (03PS2) 10Ladsgroup: Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [20:35:08] jouncebot: now [20:35:08] No deployments scheduled for the next 2 hour(s) and 24 minute(s) [20:35:17] cool deploying this patch above [20:36:06] (03CR) 10Ladsgroup: [C: 03+2] Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [20:37:12] (03Merged) 10jenkins-bot: Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715018 (https://phabricator.wikimedia.org/T251480) (owner: 10Lucas Werkmeister (WMDE)) [20:40:09] Tested on mwdebug2002, works fine, moving forward [20:41:15] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:715018|Set $wgWBRepoSettings['tmpNormalizeDataValues'] on all wikis (T251480)]] (duration: 00m 59s) [20:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:21] T251480: Normalize pagenames/filenames on save in Wikibase - https://phabricator.wikimedia.org/T251480 [20:52:42] (03PS1) 10Brennen Bearnes: dev-images: migrate repository to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) [20:57:26] (03CR) 10Ebernhardson: "PCC looks as expected, only concrete change on the hosts is adding the rewrite clauses to the conf files. https://puppet-compiler.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [21:08:41] (03PS1) 10Nikki Nikkhoui: Remove image suggestion api from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/719366 (https://phabricator.wikimedia.org/T288132) [21:09:02] (03CR) 10Cwhite: [C: 03+1] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [21:10:58] (03PS1) 10Jbond: puppetmaster: drop log messages from logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/719368 [21:26:12] RECOVERY - MariaDB Replica Lag: s4 on db1150 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:33:38] (03PS1) 10Jbond: P:puppetmaster::common: Add back logstash support [puppet] - 10https://gerrit.wikimedia.org/r/719372 [21:39:59] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the follow up, 2 very minor nits on commit/docs, no need to review again" [software/spicerack] - 10https://gerrit.wikimedia.org/r/719356 (owner: 10RLazarus) [21:57:16] RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:01:56] (03PS4) 10Legoktm: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [22:03:18] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:09:42] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1051 - https://phabricator.wikimedia.org/T290442 (10wiki_willy) a:03Cmjohnson In warranty thru 2022-08-07 [22:10:42] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1062 - https://phabricator.wikimedia.org/T290416 (10wiki_willy) a:03Cmjohnson In warranty thru 2023-10-27 [22:21:52] (03CR) 10Jforrester: [C: 03+1] dev-images: migrate repository to gitlab remote [puppet] - 10https://gerrit.wikimedia.org/r/719363 (https://phabricator.wikimedia.org/T290259) (owner: 10Brennen Bearnes) [22:24:53] (03PS1) 10Andrew Bogott: vendordata: explicitly remove ephemeral0 from cloud-init mounting [puppet] - 10https://gerrit.wikimedia.org/r/719376 (https://phabricator.wikimedia.org/T290372) [22:26:18] (03CR) 10Andrew Bogott: [C: 03+2] vendordata: explicitly remove ephemeral0 from cloud-init mounting [puppet] - 10https://gerrit.wikimedia.org/r/719376 (https://phabricator.wikimedia.org/T290372) (owner: 10Andrew Bogott) [22:28:20] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:26] (03CR) 10Cwhite: [C: 03+1] rsyslog: stop saving trafficserver logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/719052 (https://phabricator.wikimedia.org/T290305) (owner: 10Ema) [22:48:52] (03PS1) 10Ladsgroup: alertmanager: Send email on resolve for wikidata team [puppet] - 10https://gerrit.wikimedia.org/r/719380 [22:51:58] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:11] (03CR) 10Cwhite: puppetmaster: puppet prometheus reporting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/719100 (https://phabricator.wikimedia.org/T283585) (owner: 10Jbond) [22:54:52] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/719107 (https://phabricator.wikimedia.org/T281359) (owner: 10Filippo Giunchedi) [22:55:33] (03CR) 10Cwhite: [C: 03+1] o11y: add udp receive errors for statsd [alerts] - 10https://gerrit.wikimedia.org/r/719123 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [22:55:52] (03CR) 10Cwhite: [C: 03+1] statsd: remove statsd_udp_inbound_errors [puppet] - 10https://gerrit.wikimedia.org/r/719124 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [22:56:52] (03CR) 10Cwhite: [C: 03+1] prometheus: add ThanosSidecarUploadFailure to prometheus/ops [puppet] - 10https://gerrit.wikimedia.org/r/719126 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [22:57:22] (03CR) 10Cwhite: [C: 03+2] Remove image suggestion api from lookup table [puppet] - 10https://gerrit.wikimedia.org/r/719366 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [22:59:34] (03PS1) 10Ladsgroup: Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925) [23:00:03] jouncebot: now [23:00:03] For the next 0 hour(s) and 59 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T2300) [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210907T2300). [23:00:05] dpifke: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:31] Here. Patch is self service, will go ahead with it if noone else has anything that needs to go out. [23:00:48] dpifke: please go ahead and let me know once you're done [23:01:09] I'll be deploying url shortener stuff [23:01:15] (03CR) 10Dave Pifke: [C: 03+2] profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [23:01:22] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) ms-be1067 D4 U33 CABLEID#11042 PORT36 [23:01:33] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) [23:02:08] (03Merged) 10jenkins-bot: profiler: use seperate pipeline inside k8s pods [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716041 (https://phabricator.wikimedia.org/T288165) (owner: 10Dave Pifke) [23:07:36] !log dpifke@deploy1002 Synchronized wmf-config/profiler.php: Config: [[gerrit:716041|profiler: use seperate pipeline inside k8s pods (T288165)]] (duration: 00m 58s) [23:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:41] T288165: Create separate ArcLamp pipeline for k8s-mwdebug - https://phabricator.wikimedia.org/T288165 [23:08:41] Amri1: Done. [23:08:48] Amir1: Done [23:08:57] awesome! [23:09:02] Thanks [23:09:26] (03CR) 10Ladsgroup: [C: 03+2] Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925) (owner: 10Ladsgroup) [23:10:22] (03Merged) 10jenkins-bot: Enable UrlShortener everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719381 (https://phabricator.wikimedia.org/T267925) (owner: 10Ladsgroup) [23:12:44] looks good on mwdebug2002, moving forward [23:13:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:719381|Enable UrlShortener everywhere (T267925)]] (duration: 00m 58s) [23:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:58] T267925: Allow displaying URL shortener link in sidebar for foreign wiki - https://phabricator.wikimedia.org/T267925 [23:15:49] gogogogo [23:16:28] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) [23:16:40] legoktm: wohooo \o/ [23:19:53] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) Finished Provision a server's network attributes script on netbox configured bios handing over to rob for hopefully finishing [23:20:33] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) a:05Jclark-ctr→03RobH [23:20:51] !log robh@cumin1001 START - Cookbook sre.dns.netbox [23:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:49] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:34] (03PS1) 10RobH: ms-be1067 updates [puppet] - 10https://gerrit.wikimedia.org/r/719384 (https://phabricator.wikimedia.org/T285808) [23:38:17] (03CR) 10RobH: [C: 03+2] ms-be1067 updates [puppet] - 10https://gerrit.wikimedia.org/r/719384 (https://phabricator.wikimedia.org/T285808) (owner: 10RobH) [23:47:11] (03PS3) 10Legoktm: mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:50:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10RobH) [23:52:56] (03CR) 10Legoktm: [C: 03+2] mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [23:54:30] Amir1: hm, puppet failed [23:54:58] https://phabricator.wikimedia.org/P17251 [23:55:01] legoktm: hmm, how so? [23:55:27] Hmm. Let me check [23:56:39] (03PS1) 10Legoktm: mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386 [23:56:45] Amir1: ^ how's that look? [23:57:11] (03CR) 10Ladsgroup: [C: 03+1] mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm) [23:57:18] * legoktm waits for pcc [23:57:30] legoktm: yeah much better, if it's one host, we can do this instead [23:57:40] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31025/console" [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm) [23:58:00] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mailman: Try to fix 4869d91b0beb92 [puppet] - 10https://gerrit.wikimedia.org/r/719386 (owner: 10Legoktm) [23:58:18] in an hour, I'll make a patch to remove the stuff [23:58:53] Notice: /Stage[main]/Mailman::Webui/File[/etc/mailman]: Not removing directory; use 'force' to override [23:58:53] Notice: /Stage[main]/Mailman::Webui/File[/etc/mailman]/ensure: removed [23:59:02] let me just do it manually [23:59:17] I already checked the directory just to make sure it had nothing useful, it did not [23:59:29] *pretends to be shocked*