[00:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935 [00:39:02] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935 (owner: 10TrainBranchBot) [00:41:26] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye [00:41:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye [00:42:53] (03CR) 10Tim Starling: [C: 03+2] Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [00:44:02] (03PS2) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) [00:44:09] (03CR) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [00:44:52] (03Merged) 10jenkins-bot: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [00:54:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935 (owner: 10TrainBranchBot) [00:55:15] !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: enable LoginNotify seen subnets table g965663 T346989 (duration: 06m 23s) [00:55:20] T346989: Deploy LoginNotify seen subnets table - https://phabricator.wikimedia.org/T346989 [01:09:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:11:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.503 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:21:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [01:27:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1158'] [01:28:31] 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) 05Open→03Resolved a:03Jclark-ctr This done; Thanks! [01:30:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:31:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:35:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye [02:01:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [02:07:32] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:08:46] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:06] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:14] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:26:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:22] (03PS4) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [02:59:46] (03PS7) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [02:59:48] (03PS5) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [03:02:54] (03PS8) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) [03:02:56] (03PS5) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363) [03:02:58] (03PS4) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) [03:03:59] (03PS6) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763) [03:08:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:18:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:40] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:32:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:36:20] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:36:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 2.353 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:36:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.962 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:45:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:45:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:46:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:50:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:05] (03PS1) 10KartikMistry: Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458) [04:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:28:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:21:22] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:53] (03PS1) 10Marostegui: Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976334 [06:13:05] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620) [06:13:31] (03CR) 10Marostegui: [C: 03+2] Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976334 (owner: 10Marostegui) [06:14:02] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [06:14:54] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [06:15:07] (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976384 (https://phabricator.wikimedia.org/T351620) [06:15:27] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] [06:15:33] T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620 [06:15:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch [06:15:53] (03CR) 10Marostegui: [C: 03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976384 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui) [06:16:16] (03PS1) 10Stevemunene: set druid hosts to use the reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) [06:16:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch [06:16:50] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:17:04] !log marostegui@deploy2002 marostegui: Continuing with sync [06:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53688 and previous config saved to /var/cache/conftool/dbconfig/20231122-062228-root.json [06:22:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 1m 1s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [06:22:56] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] (duration: 07m 28s) [06:23:01] T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620 [06:23:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2012.codfw.wmnet with OS bookworm [06:25:56] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:52] (03CR) 10Nikerabbit: [C: 03+1] "Any idea why the Translatewiki files contain language name in addition to the language code in the file names?" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [06:31:20] (03CR) 10Nikerabbit: "There might be a few more on Thursday as then is the next export after I finished importing all Phabricator changes." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [06:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53689 and previous config saved to /var/cache/conftool/dbconfig/20231122-063733-root.json [06:38:45] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 [06:38:52] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui) [06:40:58] (03PS1) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338 [06:41:06] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui) [06:41:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2012.codfw.wmnet with reason: host reimage [06:44:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2012.codfw.wmnet with reason: host reimage [06:52:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 8h 5m 50s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [06:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53690 and previous config saved to /var/cache/conftool/dbconfig/20231122-065238-root.json [06:57:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2012.codfw.wmnet with OS bookworm [06:58:12] (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui) [06:58:17] jouncebot: next [06:58:18] In 0 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0700) [06:58:37] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui) [06:59:19] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui) [06:59:46] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] [06:59:47] (03CR) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui) [06:59:50] (03CR) 10Marostegui: [C: 03+2] Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0700) [07:01:04] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:02:08] !log marostegui@deploy2002 marostegui: Continuing with sync [07:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53691 and previous config saved to /var/cache/conftool/dbconfig/20231122-070742-root.json [07:07:57] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] (duration: 08m 10s) [07:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 to test 10.4.32 T351283', diff saved to https://phabricator.wikimedia.org/P53692 and previous config saved to /var/cache/conftool/dbconfig/20231122-071911-root.json [07:19:17] T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283 [07:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53693 and previous config saved to /var/cache/conftool/dbconfig/20231122-072247-root.json [07:22:53] (03PS1) 10Marostegui: pc2014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976648 [07:23:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:25:01] (03CR) 10Marostegui: [C: 03+2] pc2014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976648 (owner: 10Marostegui) [07:31:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:31:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:48] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:34:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:44:08] (03PS1) 10Marostegui: apt_repo.yaml: Do not reimage db1236 [puppet] - 10https://gerrit.wikimedia.org/r/976649 [07:44:40] (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Do not reimage db1236 [puppet] - 10https://gerrit.wikimedia.org/r/976649 (owner: 10Marostegui) [07:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53694 and previous config saved to /var/cache/conftool/dbconfig/20231122-074923-root.json [07:50:37] (03PS2) 10KartikMistry: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) [07:53:47] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [07:54:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [07:56:41] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/976198 (owner: 10Jbond) [07:58:41] (03PS1) 10Marostegui: apt_repo.yaml: Do not reimage db1238 [puppet] - 10https://gerrit.wikimedia.org/r/976652 [07:59:19] (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Do not reimage db1238 [puppet] - 10https://gerrit.wikimedia.org/r/976652 (owner: 10Marostegui) [08:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:03:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:04:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53695 and previous config saved to /var/cache/conftool/dbconfig/20231122-080428-root.json [08:04:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: logging::mediawiki::udp2log [08:04:32] * kart_ is here [08:05:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry) [08:05:51] (03Merged) 10jenkins-bot: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry) [08:06:05] !log kartik@deploy2002 Started scap: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] [08:06:21] T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267 [08:06:51] (03PS1) 10Muehlenhoff: Switch mwlog to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976653 (https://phabricator.wikimedia.org/T349619) [08:07:19] !log kartik@deploy2002 kartik: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:01] !log kartik@deploy2002 kartik: Continuing with sync [08:10:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch mwlog to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976653 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:14:51] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] (duration: 08m 46s) [08:14:56] T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267 [08:17:21] I'm done with deployment; [08:18:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: logging::mediawiki::udp2log [08:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:18:53] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:19:06] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:19:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53696 and previous config saved to /var/cache/conftool/dbconfig/20231122-081912-arnaudb.json [08:19:17] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:19:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53697 and previous config saved to /var/cache/conftool/dbconfig/20231122-081933-root.json [08:19:56] (03CR) 10Mvolz: rest-gateway: add params to config, rework citoid path matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [08:22:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, will need deployment to a single host first and make sure everything is working as expected, especially the paging https probes" [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [08:26:31] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) >>! In T351710#9349895, @Vgutierrez wrote: > nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites Tra... [08:27:56] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:19] 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) This is now deployed and media-backups schema is up to date. Media backups are flowing as usual. I am no... [08:32:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: titan [08:32:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:33:58] (03PS1) 10Muehlenhoff: Switch titan to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976655 (https://phabricator.wikimedia.org/T349619) [08:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53698 and previous config saved to /var/cache/conftool/dbconfig/20231122-083438-root.json [08:35:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch titan to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976655 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:36:10] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: titan [08:44:58] (03PS1) 10Filippo Giunchedi: Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) [08:46:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:46:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [08:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53699 and previous config saved to /var/cache/conftool/dbconfig/20231122-084943-root.json [08:50:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/628/con" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:54:31] (03CR) 10Filippo Giunchedi: Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:56:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:57:15] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [08:58:36] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [08:59:03] !log depool ms-fe2013 to reimage with new envoy TLS setup T317616 [08:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:08] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [08:59:12] !log depool ms-fe1013 to reimage with new envoy TLS setup T317616 [08:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [09:00:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2013.codfw.wmnet with OS bullseye [09:01:03] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye [09:01:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gerrit [09:03:15] (03PS1) 10Muehlenhoff: Switch gerrit to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976657 (https://phabricator.wikimedia.org/T349619) [09:04:57] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2041.codfw.wmnet [09:04:58] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2041.codfw.wmnet [09:05:18] PROBLEM - Check systemd state on kubernetes2041 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch gerrit to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976657 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:06:35] (03PS1) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 [09:06:40] RECOVERY - Check systemd state on kubernetes2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:29] (03CR) 10Brouberol: [V: 03+1] "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [09:08:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/633/con" [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey) [09:09:23] 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Clement_Goubert) Everything looks good, back in the cluster it goes. ` 09:04 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernete... [09:09:27] 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Clement_Goubert) 05Open→03Resolved [09:09:36] (03PS2) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 [09:10:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/634/con" [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey) [09:10:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gerrit [09:10:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53700 and previous config saved to /var/cache/conftool/dbconfig/20231122-091056-arnaudb.json [09:11:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53701 and previous config saved to /var/cache/conftool/dbconfig/20231122-091104-arnaudb.json [09:12:00] (03CR) 10Brouberol: Export the replication factor of kafka topics as a prometheus metric (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975291 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol) [09:13:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:14:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage [09:17:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage [09:17:34] (03PS3) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 [09:17:37] (03PS1) 10Elukey: profile::base::certificates: rename Puppet's CA file [puppet] - 10https://gerrit.wikimedia.org/r/976659 [09:21:20] (03PS1) 10Elukey: role::kafka::main: move to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619) [09:24:00] (03PS1) 10Giuseppe Lavagetto: update-production-images: fix docker-pkg invokation [puppet] - 10https://gerrit.wikimedia.org/r/976661 [09:25:31] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [09:26:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53702 and previous config saved to /var/cache/conftool/dbconfig/20231122-092601-arnaudb.json [09:26:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53703 and previous config saved to /var/cache/conftool/dbconfig/20231122-092609-arnaudb.json [09:27:05] (03CR) 10Vgutierrez: [C: 03+1] pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [09:27:45] (03CR) 10Vgutierrez: [C: 03+2] pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:30:10] ^^ that's gonna trigger some pybal config alerts, totally expected [09:30:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2013.codfw.wmnet with OS bullseye [09:30:36] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed: - ms-fe2013 (**PASS**) - Downtimed on Ici... [09:31:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey) [09:34:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [09:34:17] !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-role for role: kafka::main [09:34:43] (03CR) 10Elukey: [C: 03+2] role::kafka::main: move to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey) [09:35:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976659 (owner: 10Elukey) [09:35:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host wcqs2001.codfw.wmnet [09:36:47] (03PS1) 10Giuseppe Lavagetto: weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663 [09:36:54] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [09:39:24] (03PS1) 10Muehlenhoff: Switch wcqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976664 (https://phabricator.wikimedia.org/T349619) [09:40:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::main [09:41:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53704 and previous config saved to /var/cache/conftool/dbconfig/20231122-094106-arnaudb.json [09:41:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53705 and previous config saved to /var/cache/conftool/dbconfig/20231122-094114-arnaudb.json [09:43:21] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10elukey) [09:44:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch wcqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976664 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:46:17] (03CR) 10Elukey: [C: 03+2] profile::base::certificates: rename Puppet's CA file [puppet] - 10https://gerrit.wikimedia.org/r/976659 (owner: 10Elukey) [09:47:13] !log Update of the profile::base::certificate's CA bundle fleet wide (https://gerrit.wikimedia.org/r/c/operations/puppet/+/976659) [09:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:42] (03PS7) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) [09:48:49] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface: Allow creating IPIP interfaces w/o an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:49:11] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye [09:49:18] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye completed: - ms-fe1013 (**PASS**) - Downtimed on Ici... [09:49:55] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [09:50:42] elukey: we don't have a task for that not scary at all change? [09:51:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host wcqs2001.codfw.wmnet [09:51:56] vgutierrez: it is a follow up after some work that John did (upgrade wmf-certificates), I think it is part of the puppet 7's migration. Since the crt content is the same no change is triggered, but I logged it for awareness [09:52:57] (03PS4) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 [09:53:27] !log rolling restart of pybal to catch up on a NOOP config update - T351069 [09:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:32] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [09:53:52] !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs (T351069) [09:56:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53706 and previous config saved to /var/cache/conftool/dbconfig/20231122-095611-arnaudb.json [09:56:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53707 and previous config saved to /var/cache/conftool/dbconfig/20231122-095619-arnaudb.json [09:56:34] (03CR) 10Elukey: [C: 03+2] role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey) [09:59:33] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: swift::storage [10:05:07] (03PS1) 10Muehlenhoff: Switch swift::storage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976665 (https://phabricator.wikimedia.org/T349619) [10:07:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch swift::storage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976665 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:07:29] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll restart after change in the CA bundle - elukey@cumin1001 [10:11:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53708 and previous config saved to /var/cache/conftool/dbconfig/20231122-101116-arnaudb.json [10:11:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53709 and previous config saved to /var/cache/conftool/dbconfig/20231122-101124-arnaudb.json [10:14:08] PROBLEM - Check systemd state on ganeti1025 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:18] (03PS4) 10Zoranzoki21: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [10:21:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: switch 15% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976218 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [10:21:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs (T351069) [10:21:48] (03CR) 10JMeybohm: [C: 03+1] weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663 (owner: 10Giuseppe Lavagetto) [10:21:51] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [10:22:11] jouncebot: nowandnext [10:22:11] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [10:22:11] In 0 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1100) [10:23:34] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@0cca675] (releasing): (no justification provided) [10:24:15] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@0cca675] (releasing): (no justification provided) (duration: 00m 40s) [10:25:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll restart after change in the CA bundle - elukey@cumin1001 [10:25:16] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663 (owner: 10Giuseppe Lavagetto) [10:25:36] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:25:39] (03PS8) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) [10:25:42] !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll restart after change in the CA bundle - elukey@cumin1001 [10:25:48] (03PS3) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [10:25:58] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:26:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53710 and previous config saved to /var/cache/conftool/dbconfig/20231122-102621-arnaudb.json [10:26:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53711 and previous config saved to /var/cache/conftool/dbconfig/20231122-102629-arnaudb.json [10:26:52] !log repool ms-fe1013 with new envoy TLS setup T317616 [10:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:58] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [10:27:28] (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933475 (owner: 10Clément Goubert) [10:27:32] (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933470 (owner: 10Clément Goubert) [10:27:35] !log repool ms-fe2013 with new envoy TLS setup T317616 [10:27:36] (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933468 (owner: 10Clément Goubert) [10:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:02] (03Abandoned) 10Clément Goubert: mw-api-int: Raise number of replicas to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/941900 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert) [10:28:37] (03Abandoned) 10Clément Goubert: mw-on-k8s: Revert sending traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935673 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert) [10:30:51] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) I tested a revert to `gtls` for centrallog hosts (the receiver part only), rsyslog now stays silent on centrallog though I still see the (re) connections fr... [10:31:10] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:32:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: swift::storage [10:33:02] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:33:03] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:33:35] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:34:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:37:55] (03CR) 10Clément Goubert: sre.discovery.service-route: customize lock args (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:38:58] (03CR) 10Clément Goubert: sre.discovery.datacenter: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:40:01] (03CR) 10JMeybohm: [C: 04-1] "I think you still need to overwrite "command" with an empty value in values.yaml in order to actually use the entrypoint" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [10:40:11] (03PS9) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) [10:40:33] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:40:40] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:41:18] (03PS4) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [10:41:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53712 and previous config saved to /var/cache/conftool/dbconfig/20231122-104126-arnaudb.json [10:41:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53713 and previous config saved to /var/cache/conftool/dbconfig/20231122-104134-arnaudb.json [10:42:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/636/con" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:43:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll restart after change in the CA bundle - elukey@cumin1001 [10:45:16] (03CR) 10Btullis: [C: 03+1] "The logic looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans) [10:46:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-db1002.eqiad.wmnet [10:48:22] (03PS1) 10Muehlenhoff: Switch an-db1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976667 (https://phabricator.wikimedia.org/T349619) [10:50:41] (03CR) 10Muehlenhoff: [C: 03+2] Switch an-db1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976667 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:52:34] (03CR) 10Fabfur: [C: 03+1] "seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:52:52] (03CR) 10Fabfur: [C: 03+1] "looks ok, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:53:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] update-production-images: fix docker-pkg invokation [puppet] - 10https://gerrit.wikimedia.org/r/976661 (owner: 10Giuseppe Lavagetto) [10:54:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-db1002.eqiad.wmnet [10:54:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:55:06] (03CR) 10Hnowlan: [C: 03+2] service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:55:51] <_joe_> hnowlan: merged your change too [10:55:55] thanks [10:56:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53714 and previous config saved to /var/cache/conftool/dbconfig/20231122-105631-arnaudb.json [10:56:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53715 and previous config saved to /var/cache/conftool/dbconfig/20231122-105639-arnaudb.json [10:58:56] ACKNOWLEDGEMENT - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service Hnowlan Awaiting discovery records being created https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:56] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly Hnowlan Awaiting discovery records being created https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:59:04] (03PS3) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [10:59:27] (03CR) 10Clément Goubert: [C: 04-1] "jobrunner is active/passive iirc :)" [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1100) [11:01:23] (03PS4) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [11:02:14] (03CR) 10Hnowlan: wmnet: add mw-jobrunner discovery record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:02:26] (03CR) 10CI reject: [V: 04-1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:02:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] wmnet: add mw-jobrunner discovery record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:03:38] (03PS5) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [11:04:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:06:20] (03CR) 10Hnowlan: [C: 03+2] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:06:46] (03PS6) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) [11:09:03] q/19 [11:10:40] RECOVERY - Check systemd state on ganeti1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53716 and previous config saved to /var/cache/conftool/dbconfig/20231122-111136-arnaudb.json [11:11:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53717 and previous config saved to /var/cache/conftool/dbconfig/20231122-111144-arnaudb.json [11:13:56] PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [11:16:54] (03PS6) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [11:16:56] (03CR) 10Jbond: [C: 03+2] docker::reports: change ownership of base rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/976198 (owner: 10Jbond) [11:18:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [11:21:50] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) [11:21:52] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) [11:21:54] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) [11:21:56] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) [11:21:58] (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) [11:22:00] (03PS1) 10MVernon: hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) [11:23:07] I am going to restart Gerrit [11:23:18] 😱 [11:23:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:36] best case scenario it comes back after couple minutes [11:23:50] worse case scenario a handful of us cancels our plans for the next few days while bring it back up [11:23:52] (kidding) [11:24:11] I'm off to eat in about 15min, so I hope your worst case scenario is gzip-able :D [11:25:46] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:25:57] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:07] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:19] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:22] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] "I love that moss stayed!" [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:31] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:33] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:37] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:39] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53718 and previous config saved to /var/cache/conftool/dbconfig/20231122-112641-arnaudb.json [11:26:44] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53719 and previous config saved to /var/cache/conftool/dbconfig/20231122-112649-arnaudb.json [11:26:49] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:26:57] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:29:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [11:30:54] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [11:31:44] (03PS2) 10Hnowlan: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) [11:33:15] (03PS3) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [11:33:42] !log depool ms-fe1012 to reimage with new envoy TLS setup T317616 [11:33:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::postgresql [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:48] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [11:34:05] !log depool ms-fe2012 to reimage with new envoy TLS setup T317616 [11:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:14] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:35:30] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1012.eqiad.wmnet with OS bullseye [11:35:40] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye [11:35:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2012.codfw.wmnet with OS bullseye [11:35:54] (03PS1) 10Muehlenhoff: Switch analytics_cluster::postgresql to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976678 (https://phabricator.wikimedia.org/T349619) [11:35:54] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye [11:36:36] (03PS1) 10Jbond: Revert "prometheus: update to request testing certs from pki" [puppet] - 10https://gerrit.wikimedia.org/r/976574 [11:36:38] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::postgresql to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976678 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:36:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "prometheus: update to request testing certs from pki" [puppet] - 10https://gerrit.wikimedia.org/r/976574 (owner: 10Jbond) [11:37:28] (03PS1) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) [11:38:20] (03PS1) 10Hashar: gerrit: accept SIGINT as a valid exit code [puppet] - 10https://gerrit.wikimedia.org/r/976679 [11:42:35] (03PS2) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) [11:42:58] (03CR) 10Majavah: [C: 03+2] P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [11:43:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::postgresql [11:44:03] (03CR) 10Jbond: "This did not solve the original issue but it didn't seem to break anything either. As such i think it still may be worth considering" [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [11:46:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:46:23] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) >>! In T351624#9350064, @jbond wrote: > @fgiunchedi [[ https://g... [11:47:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [11:50:05] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [11:50:10] (03CR) 10Nikerabbit: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [11:50:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage [11:52:26] (03CR) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [11:53:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage [11:55:53] (03CR) 10Hashar: "After a `systemctl restart gerrit` the journal mark a failure due to the JVM exiting with code 130:" [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar) [11:56:09] !log Restarting Gerrit [11:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:51] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.com has 86348 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [12:00:59] (03PS1) 10Btullis: Add another public endpoint to our matomo installation [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) [12:01:24] (03PS14) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [12:01:26] (03PS1) 10Majavah: openstack: update wiki replica DNS to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/976688 (https://phabricator.wikimedia.org/T346947) [12:01:56] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) [12:02:22] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/637/con" [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [12:03:18] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:05:07] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) [12:05:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1012.eqiad.wmnet with OS bullseye [12:05:38] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye completed: - ms-fe1012 (**PASS**... [12:08:56] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [12:09:41] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:57] (03CR) 10Brouberol: [C: 03+1] "Looks good! Should we remove all oozie-related jobs from refinery as well?" [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [12:09:59] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Sincere thanks for all of your work on this." [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:10:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2012.codfw.wmnet with OS bullseye [12:10:21] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye completed: - ms-fe2012 (**PASS**... [12:10:57] RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:00] (03CR) 10Majavah: [V: 03+1 C: 03+2] Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:11:23] (03PS1) 10Jbond: puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690 [12:11:49] (03CR) 10Brouberol: set druid hosts to use the reuse partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [12:12:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond) [12:14:38] (03CR) 10Stevemunene: set druid hosts to use the reuse partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [12:15:08] !log repool ms-fe2012 with new envoy TLS setup T317616 [12:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [12:15:30] (03PS1) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) [12:15:33] (03CR) 10Jbond: [V: 03+1] "@jesses this relates to the git-sync-upstream in wmcs which" [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond) [12:15:52] !log repool ms-fe1012 with new envoy TLS setup T317616 [12:15:52] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [12:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:18:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [12:19:15] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:19:25] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:19:42] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1001.eqiad.wmnet with OS bullseye [12:20:48] (03PS2) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) [12:21:55] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:22:09] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:23:08] (03CR) 10JMeybohm: [C: 04-1] mw-jobrunner: add vhost for jobrunner.discovery.wmnet (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:24:39] (03PS8) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [12:24:41] (03PS17) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:24:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:24:43] (03PS13) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:24:45] (03PS16) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:24:47] (03PS16) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:24:49] (03PS2) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:25:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:25:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:26:41] (03PS42) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [12:27:19] !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudlb1002.eqiad.wmnet [12:27:45] (03CR) 10Btullis: [V: 03+2 C: 03+2] spark: add support for spark-history on the spark image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896363 (https://phabricator.wikimedia.org/T330176) (owner: 10Nicolas Fraison) [12:29:54] (03PS3) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) [12:30:30] (03CR) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:30:40] (03PS9) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [12:30:42] (03PS18) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:30:44] (03PS14) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:30:46] (03PS17) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:30:48] (03PS17) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:30:50] (03PS3) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:31:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-gp1001.eqiad.wmnet [12:32:19] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1001.eqiad.wmnet with reason: host reimage [12:32:44] (03PS1) 10Muehlenhoff: Switch mc-gp1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976694 (https://phabricator.wikimedia.org/T349619) [12:35:23] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1001.eqiad.wmnet with reason: host reimage [12:35:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-gp1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976694 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:37:09] (03Abandoned) 10Clément Goubert: mw-api-ext, mw-web: raise replicas for traffic bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert) [12:38:26] (03PS2) 10Clément Goubert: trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122) [12:38:51] (03PS1) 10Majavah: cloudlb: wikireplicas: fix frontend filename [puppet] - 10https://gerrit.wikimedia.org/r/976695 [12:38:53] (03PS1) 10Majavah: cloudlb: wikireplicas: fix go template syntax [puppet] - 10https://gerrit.wikimedia.org/r/976696 [12:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-gp1001.eqiad.wmnet [12:40:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976696 (owner: 10Majavah) [12:41:07] (03CR) 10Majavah: [C: 03+2] cloudlb: wikireplicas: fix frontend filename [puppet] - 10https://gerrit.wikimedia.org/r/976695 (owner: 10Majavah) [12:41:30] (03CR) 10Majavah: [V: 03+1 C: 03+2] cloudlb: wikireplicas: fix go template syntax [puppet] - 10https://gerrit.wikimedia.org/r/976696 (owner: 10Majavah) [12:42:41] (03PS19) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:42:43] (03PS15) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:42:45] (03PS18) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:42:47] (03PS18) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:42:49] (03PS4) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:43:29] (03Abandoned) 10JMeybohm: api-gateway: specify config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/974609 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [12:43:55] (03CR) 10JMeybohm: [C: 03+1] api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [12:44:44] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [12:44:58] (03CR) 10JMeybohm: [C: 03+2] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [12:45:02] (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:45:32] !log depool ms-fe2011 to reimage with new envoy TLS setup T317616 [12:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:36] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [12:45:41] !log depool ms-fe1011 to reimage with new envoy TLS setup T317616 [12:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:46] (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:46:03] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) I have refreshed [[ https://gerrit.wikimedia.org/r/c/operations/... [12:46:09] (03CR) 10CI reject: [V: 04-1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:46:21] jayme: LMK when you're doing running puppet-merge? [12:46:35] Emperor: done [12:46:38] ta [12:46:45] (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:47:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS bullseye [12:47:44] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye [12:47:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2011.codfw.wmnet with OS bullseye [12:48:00] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye [12:48:06] (03PS43) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [12:48:10] (03PS10) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [12:48:14] !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [12:48:14] (03PS20) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:48:22] (03PS16) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:48:26] (03PS19) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:48:30] (03PS19) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:48:34] (03PS5) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:49:30] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2420.codfw.wmnet [12:50:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host logstash2001.codfw.wmnet [12:51:12] (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:51:15] (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:51:49] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2421.codfw.wmnet [12:52:26] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2425.codfw.wmnet [12:52:34] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2431.codfw.wmnet [12:52:43] (03CR) 10CI reject: [V: 04-1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:53:03] (03PS1) 10Muehlenhoff: Switch logstash2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976699 (https://phabricator.wikimedia.org/T349619) [12:53:05] (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:53:33] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1001.eqiad.wmnet with OS bullseye [12:54:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:54:37] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1472.eqiad.wmnet [12:54:38] (03PS44) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [12:54:40] (03PS11) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [12:54:42] (03PS21) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:54:44] (03PS17) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:54:46] (03PS20) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:54:46] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1473.eqiad.wmnet [12:54:48] (03PS20) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:54:50] (03PS6) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:54:58] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:55:25] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1475.eqiad.wmnet [12:55:39] (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:55:46] !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1474.eqiad.wmnet [12:56:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:56:09] (03CR) 10Muehlenhoff: [C: 03+2] Switch logstash2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976699 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:57:17] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:57:34] (03PS22) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:57:36] (03PS18) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:57:38] (03PS21) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:57:40] (03PS21) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:57:42] (03PS7) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:58:01] (03PS1) 10Btullis: airflow: change max_active_runs_per_dag back to 1 [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) [12:58:33] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2420.codfw.wmnet [12:59:05] !log Raising mw-web and mw-api-ext replicas for traffic bump - T348122 [12:59:08] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet [12:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:10] T348122: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 [12:59:17] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:59:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [12:59:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:59:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/644/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:00:28] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/645/con" [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [13:00:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:00:36] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:00:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host logstash2001.codfw.wmnet [13:01:01] (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:01:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:01:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:01:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:42] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:01:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host logstash2023.codfw.wmnet [13:01:48] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:01:57] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:02:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage [13:02:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [13:02:54] (03PS1) 10Muehlenhoff: Switch logstash2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976701 (https://phabricator.wikimedia.org/T349619) [13:05:04] (03PS45) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [13:05:06] (03PS12) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [13:05:08] (03PS23) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [13:05:10] (03PS19) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [13:05:12] (03PS22) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [13:05:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage [13:05:14] (03PS22) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [13:05:16] (03PS8) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [13:07:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch logstash2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976701 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:12:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host logstash2023.codfw.wmnet [13:13:30] PROBLEM - Check systemd state on kubernetes2017 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:42] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:17:01] (03CR) 10Majavah: [C: 03+1] puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond) [13:17:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS bullseye [13:17:35] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye completed: - ms-fe1011 (**PASS**... [13:19:09] (03PS1) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) [13:21:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2011.codfw.wmnet with OS bullseye [13:21:33] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye completed: - ms-fe2011 (**PASS**... [13:22:51] !log repool ms-fe1011 with new envoy TLS setup T317616 [13:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:55] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [13:23:44] !log repool ms-fe2011 with new envoy TLS setup T317616 [13:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:01] (03PS2) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) [13:24:20] (03PS1) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732 [13:24:36] (03PS2) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732 (https://phabricator.wikimedia.org/T351074) [13:25:08] RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:28] (03PS3) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) [13:27:08] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [13:27:42] !log depool ms-fe1010 to reimage with new envoy TLS setup T317616 [13:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:53] !log depool ms-fe2010 to reimage with new envoy TLS setup T317616 [13:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:05] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [13:28:38] PROBLEM - Check systemd state on ganeti1032 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1010.eqiad.wmnet with OS bullseye [13:29:33] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye [13:29:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2010.codfw.wmnet with OS bullseye [13:29:47] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye [13:30:40] (03PS1) 10Brouberol: Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) [13:37:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53722 and previous config saved to /var/cache/conftool/dbconfig/20231122-133741-arnaudb.json [13:37:47] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:38:46] PROBLEM - Check systemd state on ml-serve2006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:22] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage [13:42:47] (03PS1) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [13:42:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host logstash2023.codfw.wmnet [13:43:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage [13:44:06] (03PS1) 10Majavah: P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) [13:44:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage [13:44:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/646/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [13:45:03] (03CR) 10CI reject: [V: 04-1] interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [13:45:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/647/con" [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [13:46:06] (03PS5) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [13:46:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [13:47:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage [13:47:09] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add another public endpoint to our matomo installation [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [13:47:12] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond) [13:47:14] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2421.codfw.wmnet [13:47:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/648/con" [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [13:47:19] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2431.codfw.wmnet [13:47:21] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2425.codfw.wmnet [13:47:24] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1473.eqiad.wmnet [13:47:26] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1472.eqiad.wmnet [13:47:53] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1474.eqiad.wmnet [13:47:56] !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1475.eqiad.wmnet [13:48:09] btullis: happy for me to merge your cr [13:48:23] Yes please. [13:48:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2023.codfw.wmnet [13:49:08] (03PS2) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [13:50:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/649/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [13:51:25] (03CR) 10CI reject: [V: 04-1] interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [13:52:06] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:52:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host logstash2001.codfw.wmnet [13:52:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53723 and previous config saved to /var/cache/conftool/dbconfig/20231122-135248-arnaudb.json [13:53:12] (03CR) 10Jbond: [C: 03+2] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [13:54:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/650/con" [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:56:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/651/con" [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:56:49] (03CR) 10BBlack: "On the topic of ferm::service changes: IMHO, this isn't the place to do those refactors/upgrades of the existing ferm puppetization. That" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [13:57:55] (03CR) 10Jbond: "I think we probably wan't to do this fleet wide. or consider if we want to for some time have one central log with the new certs and one " [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [13:57:58] (03PS1) 10Majavah: cloudlb: explicitely bind openstack mysql to ip [puppet] - 10https://gerrit.wikimedia.org/r/976736 [13:58:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1010.eqiad.wmnet with OS bullseye [13:59:07] (03PS1) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) [13:59:08] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye completed: - ms-fe1010 (**PASS**... [13:59:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976736 (owner: 10Majavah) [14:00:00] (03PS1) 10Filippo Giunchedi: pontoon: don't use srv records [puppet] - 10https://gerrit.wikimedia.org/r/976738 [14:00:02] (03PS1) 10Filippo Giunchedi: pontoon: add pontoon log bullseye [puppet] - 10https://gerrit.wikimedia.org/r/976739 [14:00:05] (03PS1) 10Filippo Giunchedi: varnishkafka: move to rsyslog::conf [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) [14:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1400). [14:00:06] No Gerrit patches in the queue for this window AFAICS. [14:00:07] (03PS1) 10Filippo Giunchedi: rsyslog: support alternative base in ::conf [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799) [14:00:09] (03PS1) 10Filippo Giunchedi: rsyslog_exporter: move to a define [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) [14:00:11] (03PS1) 10Filippo Giunchedi: WIP separate receiver rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799) [14:00:12] (03PS1) 10Filippo Giunchedi: prometheus: fetch from rsyslog-receiver exporter [puppet] - 10https://gerrit.wikimedia.org/r/976744 (https://phabricator.wikimedia.org/T351799) [14:00:28] standing by for jenkins -1s [14:00:54] (03CR) 10Volans: [C: 04-1] "I'm not sure it works as-is, that's also inherited by PuppetMaster so it should work there too or we should override it and raise" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:01:24] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [14:01:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2001.codfw.wmnet [14:02:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2010.codfw.wmnet with OS bullseye [14:03:13] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye completed: - ms-fe2010 (**PASS**... [14:06:22] (03PS1) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745 [14:06:59] (03CR) 10Filippo Giunchedi: [C: 03+1] syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:07:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53724 and previous config saved to /var/cache/conftool/dbconfig/20231122-140754-arnaudb.json [14:08:05] (03CR) 10Filippo Giunchedi: "See inline, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond) [14:08:45] (03CR) 10Muehlenhoff: "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond) [14:09:27] (03CR) 10Btullis: "You could add a PCC run for `Hosts: P:kubernetes::deployment_server or similar." [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [14:09:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:09:59] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:10:11] (03PS2) 10Brouberol: Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) [14:10:19] (03PS2) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745 [14:10:30] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [14:11:11] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:11:40] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:11:57] (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [14:12:03] (03CR) 10Filippo Giunchedi: [C: 03+1] sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [14:12:20] (03CR) 10Btullis: [C: 03+1] "Looks good to me in principle, but I haven't ever touched this code before." [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [14:12:22] (03CR) 10Volans: [C: 04-1] puppet: add hiera_lookup function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:12:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore2001.codfw.wmnet [14:12:53] (03Merged) 10jenkins-bot: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan) [14:14:16] !log repool ms-fe1010 with new envoy TLS setup T317616 [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:23] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [14:14:35] (03PS3) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745 [14:14:52] !log repool ms-fe2010 with new envoy TLS setup T317616 [14:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:19] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [14:18:46] (03PS1) 10Muehlenhoff: Switch sessionstore2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976747 (https://phabricator.wikimedia.org/T349619) [14:19:03] !log depool ms-fe1009 to reimage with new envoy TLS setup T317616 [14:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:12] !log depool ms-fe2009 to reimage with new envoy TLS setup T317616 [14:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2001.codfw.wmnet [14:19:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2001.codfw.wmnet [14:19:51] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS bullseye [14:20:10] PROBLEM - Check systemd state on an-worker1087 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:28] PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:20:39] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1009.eqiad.wmnet with OS bullseye [14:20:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976747 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:20:49] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye [14:21:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2009.codfw.wmnet with OS bullseye [14:21:13] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye [14:21:14] PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:34] (03CR) 10Jbond: [C: 03+2] puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond) [14:21:38] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:38] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:42] (03CR) 10Jbond: [C: 03+2] puppet:agent: change error to warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond) [14:21:54] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:21:56] PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:28] PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: don't use srv records [puppet] - 10https://gerrit.wikimedia.org/r/976738 (owner: 10Filippo Giunchedi) [14:22:54] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add pontoon log bullseye [puppet] - 10https://gerrit.wikimedia.org/r/976739 (owner: 10Filippo Giunchedi) [14:23:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53725 and previous config saved to /var/cache/conftool/dbconfig/20231122-142301-arnaudb.json [14:23:03] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [14:23:06] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance [14:23:07] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:23:10] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:23:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53726 and previous config saved to /var/cache/conftool/dbconfig/20231122-142312-arnaudb.json [14:23:15] jbond: I'll merge your patch too [14:24:16] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:34] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:24:42] godog: please [14:24:45] thanks [14:24:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:25:12] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) I hit this problem when re-imaging `ms-fe*` nodes (for T317616). Most of them PXE booted fine, but two didn't - ms-fe2014.codfw.wmnet needed one further reboot (which I... [14:25:30] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) [14:25:33] (03PS6) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [14:25:58] (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:26:00] RECOVERY - Check systemd state on ganeti1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore2001.codfw.wmnet [14:26:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/653/con" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [14:27:22] (03Abandoned) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [14:27:44] (03CR) 10Filippo Giunchedi: [V: 03+1] "See PCC, this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [14:28:01] (03PS7) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [14:29:39] (03PS8) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:29:51] (03PS1) 10Ilias Sarantopoulos: ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551) [14:29:55] RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:30:07] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:30:31] (03CR) 10Jbond: puppet: add hiera_lookup function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:30:59] !log restarting Cassandra, sessionstore2001 (post-Puppet 7 migration) [14:31:02] (03PS2) 10Filippo Giunchedi: varnishkafka: move to rsyslog::conf [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) [14:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:04] (03PS2) 10Filippo Giunchedi: rsyslog: support alternative base in ::conf [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799) [14:31:06] (03PS2) 10Filippo Giunchedi: rsyslog_exporter: move to a define [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) [14:31:08] (03PS2) 10Filippo Giunchedi: rsyslog: ship a separate 'receiver' instance [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799) [14:31:10] (03PS2) 10Filippo Giunchedi: prometheus: fetch from rsyslog-receiver exporter [puppet] - 10https://gerrit.wikimedia.org/r/976744 (https://phabricator.wikimedia.org/T351799) [14:31:43] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:32:01] PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:21] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage [14:32:29] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:37] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:37] RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:40] jouncebot: nowandnext [14:32:40] For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1400) [14:32:40] In 0 hour(s) and 27 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1500) [14:33:05] (03Abandoned) 10Ilias Sarantopoulos: ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [14:33:46] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [14:34:07] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:34:10] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [14:34:40] !log start re-provisioning and re-imaging cp1113 to fix wrong subnet (T342159) [14:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:44] T342159: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 [14:35:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage [14:35:13] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage [14:35:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage [14:36:18] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) 05Stalled→03In progress [14:37:20] (03CR) 10CI reject: [V: 04-1] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:38:03] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [14:38:19] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:38:21] RECOVERY - Check systemd state on an-worker1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:23] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:31] (03PS9) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:39:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:41:07] RECOVERY - Check systemd state on an-worker1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:24] (03Abandoned) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [14:41:55] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:42:09] PROBLEM - MD RAID on ms-fe2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:43:23] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:53] (03PS1) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [14:44:15] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:44:45] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:51] (03Abandoned) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [14:47:12] (03CR) 10Jbond: [C: 03+2] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:47:21] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-codfw.service,fetch-rings-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:35] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:48:13] PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:37] (03PS1) 10Btullis: Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) [14:49:05] (03CR) 10CI reject: [V: 04-1] Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:49:45] RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms [14:49:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:50:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/654/con" [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:50:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp1113.eqiad.wmnet [14:51:50] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ayounsi) Can you try this? T348119#9224341 Fun fact, I found that task on Google after starting to look for that specific Broadcom PXE string. [14:52:19] RECOVERY - MD RAID on ms-fe2009 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:52:49] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:13] RECOVERY - Check systemd state on ml-serve2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:21] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-codfw.service,fetch-rings-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:23] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:54:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:54:29] RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:31] (03Merged) 10jenkins-bot: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [14:54:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:54:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1009.eqiad.wmnet with OS bullseye [14:54:58] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye completed: - ms-fe1009 (**PASS**... [14:55:14] (03CR) 10Jbond: "@andrea, you may have noticed that i have based a change set of min on top of yours. I plan to merge that change set on Tuesday with fili" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [14:56:04] (03PS2) 10Btullis: Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) [14:56:11] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2001.codfw.wmnet with OS bullseye [14:56:43] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:57:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2009.codfw.wmnet with OS bullseye [14:57:17] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye completed: - ms-fe2009 (**WARN**... [14:57:44] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS bullseye [14:58:09] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:29] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:45] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:58:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752 [14:59:21] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host wdqs2008.codfw.wmnet [14:59:43] RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:43] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:49] !log repool ms-fe2009 with new envoy TLS setup T317616 [14:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:53] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [15:00:02] !log repool ms-fe1009 with new envoy TLS setup T317616 [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:07] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1500) [15:00:07] PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:17] PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:21] PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:37] PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:40] !log uncordoned and repooled kubernetes1013 [15:00:41] (03PS1) 10Muehlenhoff: Switch wdqs2008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976753 (https://phabricator.wikimedia.org/T349619) [15:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:34] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [15:01:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:01:43] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) @ayounsi: Just as another data point, I did check this (twice for many cp hosts) and all had the correct boot order. Someone should confir... [15:01:45] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:02:16] (03CR) 10Muehlenhoff: [C: 03+2] Switch wdqs2008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976753 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:02:46] !log depool moss-fe2001 to reimage with new envoy TLS setup T317616 [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] (03CR) 10JHathaway: [V: 03+2] rsync: ensure daemon is started after config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [15:02:55] (03CR) 10JHathaway: [V: 03+2 C: 03+2] rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [15:02:55] PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:03] PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:05] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:11] !log depool moss-fe1001 to reimage with new envoy TLS setup T317616 [15:03:13] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:20] (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [15:03:57] PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:09] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [15:04:35] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bullseye [15:04:42] (03CR) 10Jbond: [C: 03+1] "LGTM (comments are just fyi's)" [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:04:45] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye [15:04:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye [15:05:02] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye [15:05:13] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:13] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:29] RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:33] RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:35] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:05:45] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:53] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752 (owner: 10Volans) [15:06:29] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:06:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:06:54] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001" [15:06:55] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:55] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp1113.eqiad.wmnet [15:06:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host wdqs2008.codfw.wmnet [15:07:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `cp1113.eqiad.wmnet` - cp1113.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanag... [15:08:05] (03PS2) 10Dr0ptp4kt: wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric) [15:08:35] (03Abandoned) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [15:08:42] (03CR) 10Jbond: [C: 03+1] "puppet wise lgtm ill leave it for someone else to review the rsyslog stuff" [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:08:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [15:08:53] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10lmata) [15:09:37] (03CR) 10Btullis: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric) [15:09:37] RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:24] (03CR) 10Majavah: [C: 03+2] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric) [15:11:55] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [15:12:38] (03PS1) 10Hnowlan: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) [15:12:41] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:13:18] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752 (owner: 10Volans) [15:14:29] (03PS1) 10Volans: Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758 [15:14:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [15:15:11] (03CR) 10JMeybohm: [C: 04-1] api-gateway: use enovy.yaml in place of config.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [15:15:21] !log installing python3.7 security updates [15:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:27] (03PS2) 10Kamila Součková: mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [15:16:13] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:29] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:17:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [15:17:31] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:18:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:19:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [15:19:35] RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:20:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [15:22:06] (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [15:22:11] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:22:14] (03CR) 10Hashar: "> $ sudo journalctl -u gerrit|grep systemd.*exited" [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar) [15:22:51] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:53] RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [15:23:11] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:23:24] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:29] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:29] RECOVERY - Check systemd state on an-worker1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:30] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) I can try if/when I get another one that fails (I'd be surprised if that were the solution, given "enough reboots" seems to have wo... [15:24:30] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) [15:24:51] Emperor: ping, puppet-merge is stuck on your patch, see -sre [15:25:54] (03PS1) 10Majavah: Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775 [15:26:40] (03CR) 10JHathaway: [C: 03+1] Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775 (owner: 10Majavah) [15:26:59] (03CR) 10Volans: [C: 03+2] Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758 (owner: 10Volans) [15:27:57] (03Abandoned) 10Majavah: Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775 (owner: 10Majavah) [15:28:16] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye [15:28:26] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f... [15:28:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye [15:28:29] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-fe1001.eqiad.wmnet with OS bullseye [15:28:39] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bullseye [15:28:40] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye [15:28:43] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:28:48] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye executed with errors: - moss-f... [15:28:59] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye [15:29:43] RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:03] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:30:06] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::databases [15:30:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1175 - jclark@cumin1001" [15:31:21] (03PS1) 10Jbond: backup::databases: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976761 (https://phabricator.wikimedia.org/T349619) [15:31:49] (03CR) 10Jbond: [C: 03+2] backup::databases: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976761 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:31:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1175 - jclark@cumin1001" [15:31:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:21] (03PS10) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) [15:33:03] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:03] (03CR) 10Vgutierrez: "adding Filippo to get his take on the prometheus::ops stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:33:21] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:33:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host testreduce1002.eqiad.wmnet [15:34:15] (03CR) 10BBlack: [C: 03+1] "LGTM fundamentally, but it's hard to know the outcome in these cases until we try on a real host!" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:34:20] going to make an alter table in s8 in cloud replicas [15:34:24] (03CR) 10BBlack: [C: 03+1] "LGTM fundamentally, but it's hard to know the outcome in these cases until we try on a real host!" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:35:01] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: stunnel4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::databases [15:35:54] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:36:01] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2002.codfw.wmnet with OS bullseye [15:36:49] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::es [15:37:18] (03CR) 10Btullis: [V: 03+1 C: 03+2] airflow: change max_active_runs_per_dag back to 1 [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [15:38:17] (03PS1) 10Jbond: backup::es: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976762 (https://phabricator.wikimedia.org/T349619) [15:38:34] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1175.mgmt.eqiad.wmnet with reboot policy FORCED [15:38:45] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:38:46] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:38:56] (03Merged) 10jenkins-bot: Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758 (owner: 10Volans) [15:39:42] (03PS1) 10Majavah: hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763 [15:39:49] (03CR) 10Jbond: [C: 03+2] backup::es: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976762 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:39:54] (03CR) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol) [15:40:01] (03PS1) 10Muehlenhoff: Switch testreduce1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976764 (https://phabricator.wikimedia.org/T349619) [15:40:15] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:40:28] (03CR) 10Btullis: [C: 03+1] hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763 (owner: 10Majavah) [15:40:33] (03CR) 10Majavah: [C: 03+2] hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763 (owner: 10Majavah) [15:40:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [15:41:17] (03CR) 10Muehlenhoff: [C: 03+2] Switch testreduce1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976764 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:41:59] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [15:42:49] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:42:50] !log uploaded spicerack_8.2.0 to apt.wikimedia.org bullseye-wikimedia [15:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:43:35] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::es [15:43:40] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage [15:43:57] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:44:03] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:11] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:44:49] (03PS1) 10Majavah: Revert "hieradata: depool web wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976777 [15:45:03] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2002.codfw.wmnet [15:45:03] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::production [15:45:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2002.codfw.wmnet [15:45:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye [15:45:48] (03CR) 10Majavah: [C: 03+2] Revert "hieradata: depool web wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976777 (owner: 10Majavah) [15:45:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:46:11] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:46:12] (03CR) 10Hashar: [C: 04-1] "`/robots.txt` is indeed shared and it is more or less obsolete or at least a remnant of the past." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [15:46:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [15:46:38] (03PS1) 10Majavah: hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786 [15:46:48] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:46:53] (03PS1) 10Ladsgroup: Add virtual domain for botpasswords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559) [15:47:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host testreduce1002.eqiad.wmnet [15:47:39] (03CR) 10Btullis: [C: 03+1] hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786 (owner: 10Majavah) [15:47:51] (03CR) 10Majavah: [C: 03+2] hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786 (owner: 10Majavah) [15:47:54] (03PS1) 10Jbond: backup::production: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976788 (https://phabricator.wikimedia.org/T349619) [15:48:30] !log fabfur@cumin1001 START - Cookbook sre.dns.netbox [15:48:35] (03CR) 10Jbond: [C: 03+2] backup::production: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976788 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:49:25] (03PS2) 10Hnowlan: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) [15:49:33] taavi: feel free to mrge mine if promted [15:49:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:49:45] jbond: I already merged mine, try again? [15:49:57] RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:52] (03CR) 10Jelto: [C: 03+2] gerrit: accept SIGINT as a valid exit code [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar) [15:51:50] taavi: ck cheers [15:52:05] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding cp1113 back with correct VLAN - fabfur@cumin1001" [15:52:35] RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:52:45] (03PS1) 10Majavah: Revert "hieradata: depool analytics wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976778 [15:52:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding cp1113 back with correct VLAN - fabfur@cumin1001" [15:52:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:34] (03CR) 10Majavah: [C: 03+2] Revert "hieradata: depool analytics wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976778 (owner: 10Majavah) [15:53:43] !log fabfur@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1113 [15:54:29] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye [15:54:40] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f... [15:55:18] !log fabfur@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1113 [15:55:29] !log installing dpkg bugfix updates on bullseye [15:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye [15:56:01] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye [15:56:43] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::production [15:57:55] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1001.eqiad.wmnet with OS bullseye [15:58:27] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:58:27] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye completed: - moss-fe1001 (**WA... [15:58:31] PROBLEM - Check systemd state on an-presto1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:32] (03CR) 10Jbond: [C: 03+1] P:dns::auth::update: add support for setting ferm rules via confd (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:59:01] (03CR) 10Jbond: [C: 03+1] P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:59:13] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:37] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dbbackups::content [16:00:14] jouncebot: nowandnext [16:00:14] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [16:00:14] In 1 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800) [16:00:21] (03PS2) 10Jforrester: wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 [16:00:26] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 (owner: 10Jforrester) [16:00:42] (03CR) 10JMeybohm: Expose Netbox's BGP servers to Homer (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [16:01:19] (03Merged) 10jenkins-bot: wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 (owner: 10Jforrester) [16:01:23] (03PS1) 10Jbond: dbbackups::content: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976790 (https://phabricator.wikimedia.org/T349619) [16:01:48] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2003.codfw.wmnet with OS bullseye [16:01:53] (03CR) 10Jbond: [C: 03+2] dbbackups::content: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976790 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:02:09] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye [16:02:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus bits LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:02:16] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:02:20] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:05:20] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:05:30] !log disable Puppet on A:lvs to merge CR 976312 [16:05:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dbbackups::content [16:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:05:52] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dbbackups::metadata [16:06:00] (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh) [16:06:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:07:24] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:07:39] (03PS1) 10Jbond: dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976791 (https://phabricator.wikimedia.org/T349619) [16:08:02] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:08:11] (03CR) 10Jbond: [C: 03+2] dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976791 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:08:54] !log enable Puppet on A:lvs to merge CR 976312 and run agent [16:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:00] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:09:40] RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:09:59] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:10:02] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:10:14] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:21] (03PS1) 10Jforrester: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 [16:10:31] (03CR) 10CI reject: [V: 04-1] Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester) [16:10:42] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:11:02] (03PS2) 10Jforrester: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 [16:11:10] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester) [16:11:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye [16:11:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) @Volans Thank you for the information! I have ran through these again and with the help @RobH these should be corrected. Also, virtualization has been... [16:12:09] (03Merged) 10jenkins-bot: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester) [16:12:51] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dbbackups::metadata [16:12:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:13:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) [16:13:13] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:13:18] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:13:32] PROBLEM - Check systemd state on kubernetes2028 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:35] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:13:45] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:14:01] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mediabackup::storage [16:14:02] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) [16:14:25] !log repool moss-fe1001 with new envoy TLS setup T317616 [16:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:30] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [16:14:46] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:14:48] (03Merged) 10jenkins-bot: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:15:27] !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [16:15:27] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [16:15:28] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:15:35] Betawikis seem to be broken - Cannot log into an account. [16:15:46] (03PS1) 10Jbond: dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976793 (https://phabricator.wikimedia.org/T349619) [16:15:51] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [16:15:52] Error is Error 1146: Table 'wikishared.loginnotify_seen_net' doesn't exist [16:15:52] Function: LoginNotify\LoginNotify::userIsInCurrentSeenBucket [16:15:53] Query: SELECT 1 FROM `loginnotify_seen_net` WHERE lsn_user = 184252 AND lsn_subnet = -6951683680560312271 AND lsn_time_bucket = 2460 LIMIT 1 [16:16:05] !log depool ms-fe1014 to reimage with new envoy TLS setup T317616 [16:16:06] !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [16:16:07] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:09] !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [16:16:09] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [16:16:20] !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [16:16:20] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:16:28] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:16:29] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:16:55] (03CR) 10MVernon: [C: 03+2] hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [16:18:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1014.eqiad.wmnet with OS bullseye [16:18:14] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye [16:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:18:37] (03CR) 10Majavah: [V: 03+1 C: 03+2] cloudlb: explicitely bind openstack mysql to ip [puppet] - 10https://gerrit.wikimedia.org/r/976736 (owner: 10Majavah) [16:18:52] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [16:19:09] (03CR) 10Jbond: [C: 03+2] dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976793 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:20:42] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye [16:20:51] 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f... [16:21:18] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:21:36] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:47] (03CR) 10JMeybohm: [C: 03+1] api-gateway: use enovy.yaml in place of config.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [16:23:59] (PuppetFailure) firing: Puppet has failed on kubernetes1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:59] (PuppetZeroResources) firing: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:24:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediabackup::storage [16:24:47] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mediabackup::worker [16:25:13] (03PS1) 10Hnowlan: jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796) [16:25:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [16:25:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:25:55] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [16:26:29] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:26:42] (03PS1) 10Jbond: mediabackup::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976795 (https://phabricator.wikimedia.org/T349619) [16:26:51] (03Merged) 10jenkins-bot: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [16:27:13] (03CR) 10Jbond: [C: 03+2] mediabackup::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976795 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:28:51] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:29:57] (03CR) 10Dreamy Jazz: "This caused https://phabricator.wikimedia.org/T351828" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [16:29:59] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:30:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye [16:30:56] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye [16:31:29] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:31:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediabackup::worker [16:31:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:31:47] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:31:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:32:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:33:59] (PuppetFailure) resolved: Puppet has failed on kubernetes1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:34:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage [16:34:33] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) (perhaps the moss-fe2001 puppet failures are due to T350809 ) [16:34:59] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:35:11] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage [16:36:29] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:38:01] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [16:38:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage [16:38:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:39:16] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [16:40:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2003.codfw.wmnet with OS bullseye [16:41:59] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:42:09] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2003.codfw.wmnet [16:42:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2003.codfw.wmnet [16:42:28] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:42:42] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:43:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS bullseye [16:44:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2001.codfw.wmnet with OS bullseye [16:44:52] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye completed: - moss-fe2001 (**PASS**) - Downtimed on... [16:45:59] !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [16:46:53] !log repool moss-fe2001 with new envoy TLS setup T317616 [16:46:56] (03PS1) 10JHathaway: puppetserver: fix java_start_mem in template [puppet] - 10https://gerrit.wikimedia.org/r/976799 [16:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:57] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [16:47:23] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway) [16:47:29] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) [16:47:41] !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001" [16:47:43] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS bullseye [16:47:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye completed: - cp1113 (**PASS**) - Remo... [16:48:15] (03CR) 10Pppery: "No idea. I just followed the convention of the existing files." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [16:48:51] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:49:54] (03PS3) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [16:49:56] (03PS1) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [16:50:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway) [16:50:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway) [16:51:03] (03CR) 10CI reject: [V: 04-1] interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [16:51:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/655/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [16:52:42] (03PS2) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [16:53:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:55:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1014.eqiad.wmnet with OS bullseye [16:55:16] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye completed: - ms-fe1014 (**PASS**) - Downtimed on Ici... [16:55:53] !log installed spicerack v8.2.0 to the cumin hosts [16:55:54] (03CR) 10Hnowlan: [C: 03+2] jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:22] RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:36] !log repool ms-fe1014 with new envoy TLS setup T317616 [16:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:41] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [16:56:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 13 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [16:56:48] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:20] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [16:57:26] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1113.eqiad.wmnet [16:57:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1113.eqiad.wmnet [16:57:48] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) [16:59:03] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:59:31] (03PS1) 10Hnowlan: api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801 [17:01:04] !log swapped cp1113 <-> cp1088 (T349244) [17:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:16] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [17:02:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [17:02:07] (03CR) 10Hnowlan: [C: 03+2] api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801 (owner: 10Hnowlan) [17:03:01] (03Merged) 10jenkins-bot: api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801 (owner: 10Hnowlan) [17:06:53] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [17:07:09] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [17:10:58] (03CR) 10Pppery: "Noted. I'll update this patch (and the related one elsewhere in the tree that updates the files actually read by Phabricator) then" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery) [17:11:59] RECOVERY - Check systemd state on kubernetes2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:40] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene) [17:21:19] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:21:46] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:23:34] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:23:35] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2004.codfw.wmnet with OS bullseye [17:23:55] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:24:45] (03PS1) 10Samtar: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 [17:24:51] (03PS2) 10Samtar: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 [17:25:00] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:25:10] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:25:15] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:17] jouncebot: nowandnext [17:25:17] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [17:25:17] In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800) [17:26:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 (owner: 10Samtar) [17:26:23] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2004.codfw.wmnet [17:26:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2004.codfw.wmnet [17:26:50] (03Merged) 10jenkins-bot: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 (owner: 10Samtar) [17:27:08] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:27:21] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:27:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2005.codfw.wmnet with OS bullseye [17:27:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1175.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:02] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:28:15] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:29:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1175'] [17:30:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:36:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1175'] [17:37:24] (03CR) 10Dreamy Jazz: Reapply "Enable LoginNotify seen subnets table"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [17:39:37] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 10021 MB (4% inode=66%): /tmp 10021 MB (4% inode=66%): /var/tmp 10021 MB (4% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [17:42:50] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [17:44:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [17:45:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [17:45:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert [17:51:03] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976804 (https://phabricator.wikimedia.org/T308142) [17:55:27] (03PS1) 10Fabfur: conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) [17:58:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:30] (03PS2) 10Fabfur: conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800) [18:02:29] (03CR) 10Ssingh: [C: 03+1] conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [18:03:16] (03CR) 10Fabfur: [C: 03+2] conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [18:03:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2005.codfw.wmnet with OS bullseye [18:04:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:06:35] RECOVERY - Check systemd state on kubernetes2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:51] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:06:59] (03PS1) 10Fabfur: conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244) [18:07:15] (03CR) 10Ssingh: [C: 03+1] conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [18:12:39] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:01] (03CR) 10Fabfur: [C: 03+2] conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [18:16:15] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 337 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:16:34] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1160.eqiad.wmnet with OS bullseye [18:16:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [18:17:17] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 53 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:18:17] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:19:46] (03CR) 10Dzahn: "could this have caused https://phabricator.wikimedia.org/T351832 ?" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [18:20:21] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 10800 MB (4% inode=65%): /tmp 10800 MB (4% inode=65%): /var/tmp 10800 MB (4% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [18:21:53] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 80 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:22:37] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 19 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:25:55] PROBLEM - Check systemd state on kubernetes2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) attempting to reimage an-worker1160 it sticks at requesting a lease for boot, host shows the MAC of the eth0 attempting to request a dhcp lease for boot. on insta... [18:29:37] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:29:45] PROBLEM - Check systemd state on an-worker1152 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:21] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:32:29] RECOVERY - Check systemd state on an-worker1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:44] (03PS1) 10Dzahn: doc: move rsync auth secrets to new location to unbreak puppet [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832) [18:37:36] (03CR) 10JHathaway: [C: 03+1] "thanks, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832) (owner: 10Dzahn) [18:38:40] (03CR) 10Dzahn: [C: 03+2] doc: move rsync auth secrets to new location to unbreak puppet [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832) (owner: 10Dzahn) [18:40:43] RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [18:48:31] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:39] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:49:35] (03CR) 10Dzahn: "fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/976830" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [18:50:59] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:55:36] (03PS1) 10Majavah: rsync: do not included config for absented modules [puppet] - 10https://gerrit.wikimedia.org/r/976835 [18:56:32] (03PS27) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [18:59:46] (03CR) 10Muehlenhoff: Initial checkin of community_civicrm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:00:06] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1900) [19:03:23] (03PS1) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) [19:04:02] (03PS2) 10DDesouza: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) [19:04:19] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 41 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:04:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 141 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:05:30] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2005.codfw.wmnet [19:05:31] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2005.codfw.wmnet [19:06:35] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2006.codfw.wmnet with OS bullseye [19:09:59] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 63 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:14:39] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:03] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 24 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:18:57] (03PS2) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) [19:22:57] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [19:23:11] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 43 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:24:50] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:23] (03PS1) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) [19:25:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [19:26:17] 10SRE, 10Data-Engineering, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10brouberol) I wonder if something as simple as round robin DNS implemented with multiple A records with the same subdomain would suffice to substantially improve the situation. In... [19:28:33] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 21 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:30:46] (03PS1) 10DDesouza: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) [19:33:51] (03PS1) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) [19:36:41] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 49 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:36:51] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1160.eqiad.wmnet with OS bullseye [19:36:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye executed with errors: - an-worker1... [19:42:03] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 19 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:42:12] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976835 (owner: 10Majavah) [19:44:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53732 and previous config saved to /var/cache/conftool/dbconfig/20231122-194428-arnaudb.json [19:44:36] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:47:45] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2006.codfw.wmnet with OS bullseye [19:55:27] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2006.codfw.wmnet [19:55:27] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2006.codfw.wmnet [19:56:32] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2007.codfw.wmnet with OS bullseye [19:59:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53733 and previous config saved to /var/cache/conftool/dbconfig/20231122-195934-arnaudb.json [20:09:55] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10wiki_willy) a:03Jclark-ctr [20:10:46] (03PS5) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509) [20:10:48] (03PS1) 10Jforrester: wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976846 (https://phabricator.wikimedia.org/T349385) [20:10:55] (03PS1) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500) [20:11:08] (03PS1) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-17-200241 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976848 (https://phabricator.wikimedia.org/T297509) [20:11:33] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [20:13:03] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 62 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:14:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [20:14:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53734 and previous config saved to /var/cache/conftool/dbconfig/20231122-201441-arnaudb.json [20:15:21] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:23] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:19:43] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:27:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:27:13] (03PS2) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) [20:28:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:29:13] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 26 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:29:29] (03PS3) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) [20:29:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53735 and previous config saved to /var/cache/conftool/dbconfig/20231122-202947-arnaudb.json [20:29:52] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:30:05] (03PS2) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353) [20:33:51] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2007.codfw.wmnet with OS bullseye [20:34:21] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:34:59] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:35:10] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2007.codfw.wmnet [20:35:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2007.codfw.wmnet [20:35:48] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2008.codfw.wmnet with OS bullseye [20:36:29] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:37:07] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:37:45] (03PS3) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) [20:38:01] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:14] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) [20:41:05] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) [20:41:14] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) [20:41:53] (03CR) 10JHathaway: [C: 03+2] puppetserver: fix java_start_mem in template [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway) [20:43:37] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) ` cookbook [GLOBAL_ARGS] sre.ganeti.makevm: error: argument --memory: Memory must be at least 1.5G ` Oh really? Well then 1.5G. But we used to have VMs with 256MB, didnt we [20:43:54] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [20:44:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) ` sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5G ... .. error: argument --memory: invalid validate_memory value: '1.5G' ` ` sudo cookbook sre.ganeti.makevm --vc... [20:45:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1175.eqiad.wmnet with OS bullseye [20:45:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye [20:47:54] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host planet1003.eqiad.wmnet [20:47:55] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [20:50:04] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1003.eqiad.wmnet - dzahn@cumin1001" [20:50:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1003.eqiad.wmnet - dzahn@cumin1001" [20:50:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:52] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache planet1003.eqiad.wmnet on all recursors [20:50:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) planet1003.eqiad.wmnet on all recursors [20:51:29] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1003.eqiad.wmnet - dzahn@cumin1001" [20:52:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1003.eqiad.wmnet - dzahn@cumin1001" [20:53:46] (03PS1) 10Dzahn: site: add planet[12]003 to role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) [20:53:48] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/975832 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [20:53:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:58:09] (03PS1) 10JHathaway: g10k: spelling [puppet] - 10https://gerrit.wikimedia.org/r/976856 [20:58:11] (03PS1) 10JHathaway: puppetserver: use a symlink to swap in new code [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) [20:58:35] (03PS1) 10Dzahn: site: add planet[12]003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/976858 (https://phabricator.wikimedia.org/T351849) [20:58:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [20:59:03] (03CR) 10Dzahn: [C: 03+2] site: add planet[12]003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/976858 (https://phabricator.wikimedia.org/T351849) (owner: 10Dzahn) [20:59:05] (03CR) 10JHathaway: [C: 03+2] g10k: spelling [puppet] - 10https://gerrit.wikimedia.org/r/976856 (owner: 10JHathaway) [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T2100). [21:00:06] danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:27] o/ [21:00:41] I can deploy [21:00:47] (03PS2) 10Dzahn: site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) [21:00:53] (03PS3) 10Dzahn: site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) [21:01:18] (03CR) 10Dzahn: [C: 04-1] site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [21:01:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:02:49] (03Merged) 10jenkins-bot: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza) [21:02:56] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [21:03:06] !log catrope@deploy2002 Started scap: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]] [21:03:13] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:03:15] (03PS28) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [21:04:28] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [21:07:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm [21:07:25] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [21:07:31] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm [21:11:43] !log catrope@deploy2002 catrope and dani: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:12:02] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:12:21] danisztls: Your first change (undeploy Reader Demographics 2 on enwiki) is now ready for testing on the test servers, please test and ping me when you've confirmed it works [21:14:14] RoanKattouw: looks good [21:16:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [21:18:47] !log catrope@deploy2002 catrope and dani: Continuing with sync [21:19:05] (03PS1) 10JHathaway: apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) [21:19:42] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [21:19:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [21:22:15] (03PS1) 10Gergő Tisza: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) [21:22:48] (03CR) 10CI reject: [V: 04-1] CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza) [21:24:44] (03PS29) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [21:24:48] (03PS2) 10Gergő Tisza: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) [21:24:50] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]] (duration: 21m 43s) [21:24:54] T344393: Quicksurvey deployment for readers survey - https://phabricator.wikimedia.org/T344393 [21:25:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:26:21] Alright the Reader Demographics change is deployed, the Core Metrics one is next [21:26:57] (03PS4) 10Catrope: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:27:01] RoanKattouw: thanks! [21:27:02] (03CR) 10TrainBranchBot: "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:27:14] (03CR) 10Catrope: [C: 03+2] Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:27:53] (03Merged) 10jenkins-bot: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:28:10] !log catrope@deploy2002 Started scap: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]] [21:28:15] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:28:53] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2008.codfw.wmnet with OS bullseye [21:29:28] !log catrope@deploy2002 catrope and dani: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:41] (03CR) 10Dzahn: [C: 03+1] apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [21:29:45] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2008.codfw.wmnet [21:29:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2008.codfw.wmnet [21:29:54] danisztls: The Core Metrics patch is on the test servers, please test [21:30:04] (03CR) 10JHathaway: [C: 03+2] apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway) [21:30:06] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2009.codfw.wmnet with OS bullseye [21:30:39] RoanKattouw: looks good [21:30:55] !log catrope@deploy2002 catrope and dani: Continuing with sync [21:31:16] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [21:34:22] (03CR) 10Dwisehaupt: "Thanks for the suggestions. Updates made and changeset rebased to pull in the lasted repo updates." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:37:15] !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]] (duration: 09m 04s) [21:37:20] T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353 [21:37:43] (03PS4) 10Catrope: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:37:48] (03CR) 10Catrope: [C: 03+2] [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:37:59] (03PS2) 10Catrope: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [21:38:03] (03CR) 10Catrope: [C: 03+2] [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [21:39:03] RoanKattouw: thanks! [21:39:08] (03Merged) 10jenkins-bot: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza) [21:39:16] (03Merged) 10jenkins-bot: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [21:39:37] danisztls: Now that these beta patches are merged, there's no manual deployment process, they're automatically deployed to beta labs but it can take ~15 minutes [21:42:27] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [21:44:00] RoanKattouw: no problem, I will check them later [21:44:07] thanks, again! [21:44:12] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [21:44:21] Great! And I think that's everything for today [21:46:44] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [21:56:54] (03PS1) 10Dzahn: hieradata: set planet[12]003 to use puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976867 (https://phabricator.wikimedia.org/T351849) [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T2200) [22:00:05] (03CR) 10Jgreen: [C: 03+1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:00:09] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikimedia.is has 86391 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [22:00:18] (03CR) 10Dzahn: [C: 03+2] hieradata: set planet[12]003 to use puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976867 (https://phabricator.wikimedia.org/T351849) (owner: 10Dzahn) [22:00:43] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86356 seconds left https://wikitech.wikimedia.org/wiki/Ncredir [22:04:16] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868 [22:05:32] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2009.codfw.wmnet with OS bullseye [22:06:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye [22:07:13] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1175.eqiad.wmnet with OS bullseye [22:07:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [22:08:47] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet1003.eqiad.wmnet with OS bookworm [22:08:47] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host planet1003.eqiad.wmnet [22:08:57] 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors: -... [22:08:59] (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:09:37] jhathaway: ^ something went wrong with the patch maybe? [22:10:18] hmm, strange, a manual puppet run was successful, let me check, thanks mutante [22:10:29] is the 2001 vs 1001? [22:11:15] there is only 2001, to my knowledge [22:11:29] also it doesn't show up on the alerts dashboard, hmm [22:11:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm [22:11:53] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm [22:16:02] apt-staging is a single host, more of an initial PoC which will either get extended to with a second staging host or folded into the main apt servers, TBD [22:16:10] it's rsync endpoints are the gitlab runners [22:17:09] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868 (owner: 10Ebernhardson) [22:17:29] nod, thanks moritzm [22:17:56] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868 (owner: 10Ebernhardson) [22:18:27] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:18:34] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:18:43] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:19:27] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:18] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2010.codfw.wmnet with OS bullseye [22:20:34] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye [22:21:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [22:23:43] (03CR) 10JHathaway: [C: 03+2] puppetserver: use a symlink to swap in new code [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [22:24:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [22:25:40] !log start cirrus updater backfilling into relforge [22:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:11] (03PS1) 10JHathaway: puppet-merge: test, no changes [puppet] - 10https://gerrit.wikimedia.org/r/976871 [22:33:09] (03CR) 10JHathaway: [C: 03+2] puppet-merge: test, no changes [puppet] - 10https://gerrit.wikimedia.org/r/976871 (owner: 10JHathaway) [22:34:29] !log puppetserver1001 - manually signed puppet cert request for planet1003 [22:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:26] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host planet2003.codfw.wmnet [22:35:28] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:38:59] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet2003.codfw.wmnet - dzahn@cumin1001" [22:40:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet2003.codfw.wmnet - dzahn@cumin1001" [22:40:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:40:19] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache planet2003.codfw.wmnet on all recursors [22:40:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) planet2003.codfw.wmnet on all recursors [22:40:49] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet2003.codfw.wmnet - dzahn@cumin1001" [22:41:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet2003.codfw.wmnet - dzahn@cumin1001" [22:41:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet2003.codfw.wmnet with OS bookworm [22:42:02] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm [22:43:01] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2010.codfw.wmnet with OS bullseye [22:43:19] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye [22:49:09] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:52:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ganeti servers in codfw - jhancock@cumin2002" [22:53:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ganeti servers in codfw - jhancock@cumin2002" [22:53:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:57:19] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [22:57:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 44m 37s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [22:58:08] (03PS1) 10JHathaway: Revert "puppet-merge: test, no changes" [puppet] - 10https://gerrit.wikimedia.org/r/976818 [23:00:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage [23:01:15] (03CR) 10JHathaway: [C: 03+2] Revert "puppet-merge: test, no changes" [puppet] - 10https://gerrit.wikimedia.org/r/976818 (owner: 10JHathaway) [23:02:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 30m 48s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [23:02:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2028.mgmt.codfw.wmnet with reboot policy FORCED [23:02:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [23:05:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage [23:06:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 14m 24s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [23:09:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2029.mgmt.codfw.wmnet with reboot policy FORCED [23:10:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2030.mgmt.codfw.wmnet with reboot policy FORCED [23:10:54] PROBLEM - Check systemd state on logstash2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 40m 25s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [23:13:05] (03PS4) 10JHathaway: dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) [23:13:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [23:13:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2028.mgmt.codfw.wmnet with reboot policy FORCED [23:15:02] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:40] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:02] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2031.mgmt.codfw.wmnet with reboot policy FORCED [23:18:38] PROBLEM - Check systemd state on bast2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:12] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2029.mgmt.codfw.wmnet with reboot policy FORCED [23:21:36] PROBLEM - Check systemd state on krb2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2030.mgmt.codfw.wmnet with reboot policy FORCED [23:22:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2032.mgmt.codfw.wmnet with reboot policy FORCED [23:22:54] PROBLEM - Check systemd state on ms-be2062 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:02] PROBLEM - Check systemd state on kubernetes2052 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [23:26:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2010.codfw.wmnet with OS bullseye [23:26:10] PROBLEM - Check systemd state on ganeti-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [23:26:26] PROBLEM - Check systemd state on kubernetes2032 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:28] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [23:27:50] PROBLEM - Check systemd state on kubernetes1039 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:23] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:29:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2031.mgmt.codfw.wmnet with reboot policy FORCED [23:30:12] PROBLEM - Check systemd state on kubernetes1017 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:22] PROBLEM - Check systemd state on kafka-main2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:24] PROBLEM - Check systemd state on ganeti1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:48] PROBLEM - Check systemd state on kubernetes2037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2034.mgmt.codfw.wmnet with reboot policy FORCED [23:32:50] (03PS30) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [23:32:56] PROBLEM - Check systemd state on kubestage1004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:18] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:18] PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:19] (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:33:20] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:54] PROBLEM - Check systemd state on kubernetes1037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2032.mgmt.codfw.wmnet with reboot policy FORCED [23:34:26] PROBLEM - Check systemd state on kafka-main2005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:26] PROBLEM - Check systemd state on ganeti4005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:32] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:36] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:46] PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2035.mgmt.codfw.wmnet with reboot policy FORCED [23:36:14] PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:32] PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:42] PROBLEM - Check systemd state on sessionstore2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:48] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:52] PROBLEM - Check systemd state on ganeti5005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:36:52] PROBLEM - Check systemd state on backup2007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:08] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:26] PROBLEM - Check systemd state on ml-cache1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:12] PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:20] PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:22] PROBLEM - Check systemd state on ganeti-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:39:06] PROBLEM - Check systemd state on ganeti-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:36] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:42] PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:40:56] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:04] PROBLEM - Check systemd state on ganeti5007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:04] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:10] PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:14] PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:26] PROBLEM - Check systemd state on an-presto1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:42] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:43:46] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:52] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:56] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:02] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2034.mgmt.codfw.wmnet with reboot policy FORCED [23:44:22] PROBLEM - Check systemd state on an-presto1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:34] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:52] PROBLEM - Check systemd state on ganeti2010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [23:45:04] PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:04] PROBLEM - Check systemd state on ganeti1023 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:39] (03PS31) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) [23:45:52] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:56] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2035.mgmt.codfw.wmnet with reboot policy FORCED [23:47:14] PROBLEM - Check systemd state on ms-backup2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:26] PROBLEM - Check systemd state on kubernetes1024 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:26] PROBLEM - Check systemd state on kubernetes2035 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:02] PROBLEM - Check systemd state on ganeti1021 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:54] PROBLEM - Check systemd state on backup2006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:26] PROBLEM - Check systemd state on ml-serve1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028'] [23:50:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2028'] [23:50:08] PROBLEM - Check systemd state on cp4037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028'] [23:50:38] (03CR) 10Dwisehaupt: "Minor changes to not show diff with db password on the grants file. And update grants to the current grants used." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:50:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2028'] [23:50:58] PROBLEM - Check systemd state on kubernetes1045 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028'] [23:51:04] PROBLEM - Check systemd state on kubernetes2046 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2028'] [23:51:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028'] [23:51:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2028'] [23:52:04] PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:50] PROBLEM - Check systemd state on an-presto1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2029'] [23:53:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2029'] [23:54:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2029'] [23:54:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2029'] [23:54:39] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2030'] [23:54:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED [23:55:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2030'] [23:55:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2031'] [23:55:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2031'] [23:56:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2032'] [23:56:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2032'] [23:56:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2033'] [23:57:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2033'] [23:58:16] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2034'] [23:58:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2034'] [23:58:54] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2035'] [23:59:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2035']