[00:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:38:56] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935
[00:39:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935 (owner: 10TrainBranchBot)
[00:41:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye
[00:41:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye
[00:42:53] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[00:44:02] <wikibugs>	 (03PS2) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989)
[00:44:09] <wikibugs>	 (03CR) 10Tim Starling: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[00:44:52] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "Enable LoginNotify seen subnets table"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[00:54:09] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/975935 (owner: 10TrainBranchBot)
[00:55:15] <logmsgbot>	 !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: enable LoginNotify seen subnets table g965663 T346989 (duration: 06m 23s)
[00:55:20] <stashbot>	 T346989: Deploy LoginNotify seen subnets table - https://phabricator.wikimedia.org/T346989
[01:09:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:11:08] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.503 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:21:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[01:27:51] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1158']
[01:28:31] <wikibugs>	 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) 05Open→03Resolved a:03Jclark-ctr This done; Thanks!
[01:30:20] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:31:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:35:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:41] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye
[02:01:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[02:07:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:08:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:06] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:14] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:26:24] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:38:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:22] <wikibugs>	 (03PS4) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763)
[02:59:46] <wikibugs>	 (03PS7) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[02:59:48] <wikibugs>	 (03PS5) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763)
[03:02:54] <wikibugs>	 (03PS8) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[03:02:56] <wikibugs>	 (03PS5) 10Pppery: Merge in changes to qqq.json rather than overwriting them [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975392 (https://phabricator.wikimedia.org/T351363)
[03:02:58] <wikibugs>	 (03PS4) 10Pppery: Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363)
[03:03:59] <wikibugs>	 (03PS6) 10Pppery: Run generate.php and arc liberate [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975413 (https://phabricator.wikimedia.org/T318763)
[03:08:23] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:18:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:18:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:24:10] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:28:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:28:40] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:32:40] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:32:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:36:20] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:36:40] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 2.353 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:36:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.962 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:45:04] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:45:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:46:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:50:36] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:04:05] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458)
[04:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:28:10] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:21:22] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:10:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976334
[06:13:05] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620)
[06:13:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1210: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976334 (owner: 10Marostegui)
[06:14:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui)
[06:14:54] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976383 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui)
[06:15:07] <wikibugs>	 (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976384 (https://phabricator.wikimedia.org/T351620)
[06:15:27] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]]
[06:15:33] <stashbot>	 T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620
[06:15:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch
[06:15:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/976384 (https://phabricator.wikimedia.org/T351620) (owner: 10Marostegui)
[06:16:16] <wikibugs>	 (03PS1) 10Stevemunene: set druid hosts to use the reuse partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589)
[06:16:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2012,2014].codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Switch
[06:16:50] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[06:17:04] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[06:22:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53688 and previous config saved to /var/cache/conftool/dbconfig/20231122-062228-root.json
[06:22:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 1m 1s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[06:22:56] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:976383|ProductionServices.php: Promote pc2014 to pc2 master (T351620)]] (duration: 07m 28s)
[06:23:01] <stashbot>	 T351620: Upgrade pc2 to Debian Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351620
[06:23:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2012.codfw.wmnet with OS bookworm
[06:25:56] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:52] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] "Any idea why the Translatewiki files contain language name in addition to the language code in the file names?" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery)
[06:31:20] <wikibugs>	 (03CR) 10Nikerabbit: "There might be a few more on Thursday as then is the next export after I finished importing all Phabricator changes." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery)
[06:37:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53689 and previous config saved to /var/cache/conftool/dbconfig/20231122-063733-root.json
[06:38:45] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337
[06:38:52] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui)
[06:40:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338
[06:41:06] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui)
[06:41:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2012.codfw.wmnet with reason: host reimage
[06:44:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2012.codfw.wmnet with reason: host reimage
[06:52:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 8h 5m 50s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[06:52:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53690 and previous config saved to /var/cache/conftool/dbconfig/20231122-065238-root.json
[06:57:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2012.codfw.wmnet with OS bookworm
[06:58:12] <wikibugs>	 (03CR) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui)
[06:58:17] <marostegui>	 jouncebot: next
[06:58:18] <jouncebot>	 In 0 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0700)
[06:58:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui)
[06:59:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976337 (owner: 10Marostegui)
[06:59:46] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]]
[06:59:47] <wikibugs>	 (03CR) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui)
[06:59:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/976338 (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0700)
[07:01:04] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:02:08] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[07:07:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53691 and previous config saved to /var/cache/conftool/dbconfig/20231122-070742-root.json
[07:07:57] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:976337|Revert "ProductionServices.php: Promote pc2014 to pc2 master"]] (duration: 08m 10s)
[07:19:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 to test 10.4.32 T351283', diff saved to https://phabricator.wikimedia.org/P53692 and previous config saved to /var/cache/conftool/dbconfig/20231122-071911-root.json
[07:19:17] <stashbot>	 T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283
[07:22:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53693 and previous config saved to /var/cache/conftool/dbconfig/20231122-072247-root.json
[07:22:53] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976648
[07:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:25:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/976648 (owner: 10Marostegui)
[07:31:08] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:31:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:33:48] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:34:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:44:08] <wikibugs>	 (03PS1) 10Marostegui: apt_repo.yaml: Do not reimage db1236 [puppet] - 10https://gerrit.wikimedia.org/r/976649
[07:44:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Do not reimage db1236 [puppet] - 10https://gerrit.wikimedia.org/r/976649 (owner: 10Marostegui)
[07:49:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 10%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53694 and previous config saved to /var/cache/conftool/dbconfig/20231122-074923-root.json
[07:50:37] <wikibugs>	 (03PS2) 10KartikMistry: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267)
[07:53:47] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch)
[07:54:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[07:56:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/976198 (owner: 10Jbond)
[07:58:41] <wikibugs>	 (03PS1) 10Marostegui: apt_repo.yaml: Do not reimage db1238 [puppet] - 10https://gerrit.wikimedia.org/r/976652
[07:59:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] apt_repo.yaml: Do not reimage db1238 [puppet] - 10https://gerrit.wikimedia.org/r/976652 (owner: 10Marostegui)
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T0800).
[08:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:03:52] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[08:04:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53695 and previous config saved to /var/cache/conftool/dbconfig/20231122-080428-root.json
[08:04:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: logging::mediawiki::udp2log
[08:04:32] * kart_ is here
[08:05:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry)
[08:05:51] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Content/Section translation on some Wikipedias with potential to be supported with MinT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975924 (https://phabricator.wikimedia.org/T345267) (owner: 10KartikMistry)
[08:06:05] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]]
[08:06:21] <stashbot>	 T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267
[08:06:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mwlog to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976653 (https://phabricator.wikimedia.org/T349619)
[08:07:19] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:09:01] <logmsgbot>	 !log kartik@deploy2002 kartik: Continuing with sync
[08:10:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mwlog to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976653 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:14:51] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:975924|Enable Content/Section translation on some Wikipedias with potential to be supported with MinT (T345267)]] (duration: 08m 46s)
[08:14:56] <stashbot>	 T345267: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT - https://phabricator.wikimedia.org/T345267
[08:17:21] <kart_>	 I'm done with deployment;
[08:18:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: logging::mediawiki::udp2log
[08:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:18:53] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[08:19:06] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance
[08:19:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53696 and previous config saved to /var/cache/conftool/dbconfig/20231122-081912-arnaudb.json
[08:19:17] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[08:19:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53697 and previous config saved to /var/cache/conftool/dbconfig/20231122-081933-root.json
[08:19:56] <wikibugs>	 (03CR) 10Mvolz: rest-gateway: add params to config, rework citoid path matching (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[08:22:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, will need deployment to a single host first and make sure everything is working as expected, especially the paging https probes" [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[08:26:31] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) >>! In T351710#9349895, @Vgutierrez wrote: > nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites  Tra...
[08:27:56] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:19] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Data-Persistence-Backup, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) This is now deployed and media-backups schema is up to date. Media backups are flowing as usual. I am no...
[08:32:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: titan
[08:32:39] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:33:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch titan to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976655 (https://phabricator.wikimedia.org/T349619)
[08:34:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53698 and previous config saved to /var/cache/conftool/dbconfig/20231122-083438-root.json
[08:35:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch titan to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976655 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:36:10] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:41:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: titan
[08:44:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710)
[08:46:01] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:46:28] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch)
[08:49:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Testing 10.4.32', diff saved to https://phabricator.wikimedia.org/P53699 and previous config saved to /var/cache/conftool/dbconfig/20231122-084943-root.json
[08:50:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/628/con" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[08:54:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[08:56:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[08:57:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "centrallog: update tls_netstream_driver to use ossl" [puppet] - 10https://gerrit.wikimedia.org/r/976656 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[08:58:36] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976229 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[08:59:03] <Emperor>	 !log depool ms-fe2013 to reimage with new envoy TLS setup T317616
[08:59:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:08] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[08:59:12] <Emperor>	 !log depool ms-fe1013 to reimage with new envoy TLS setup T317616
[08:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:36] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye
[09:00:52] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2013.codfw.wmnet with OS bullseye
[09:01:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye
[09:01:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gerrit
[09:03:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gerrit to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976657 (https://phabricator.wikimedia.org/T349619)
[09:04:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes2041.codfw.wmnet
[09:04:58] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2041.codfw.wmnet
[09:05:18] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2041 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gerrit to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976657 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:06:35] <wikibugs>	 (03PS1) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658
[09:06:40] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:29] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[09:08:11] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/633/con" [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey)
[09:09:23] <wikibugs>	 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Clement_Goubert) Everything looks good, back in the cluster it goes.  ` 09:04 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernete...
[09:09:27] <wikibugs>	 10SRE, 10ops-codfw, 10Prod-Kubernetes, 10serviceops: kubernetes2041.codfw.wmnet NotReady - https://phabricator.wikimedia.org/T351704 (10Clement_Goubert) 05Open→03Resolved
[09:09:36] <wikibugs>	 (03PS2) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658
[09:10:45] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/634/con" [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey)
[09:10:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gerrit
[09:10:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53700 and previous config saved to /var/cache/conftool/dbconfig/20231122-091056-arnaudb.json
[09:11:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53701 and previous config saved to /var/cache/conftool/dbconfig/20231122-091104-arnaudb.json
[09:12:00] <wikibugs>	 (03CR) 10Brouberol: Export the replication factor of kafka topics as a prometheus metric (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975291 (https://phabricator.wikimedia.org/T346887) (owner: 10Brouberol)
[09:13:00] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:14:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage
[09:17:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2013.codfw.wmnet with reason: host reimage
[09:17:34] <wikibugs>	 (03PS3) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658
[09:17:37] <wikibugs>	 (03PS1) 10Elukey: profile::base::certificates: rename Puppet's CA file [puppet] - 10https://gerrit.wikimedia.org/r/976659
[09:21:20] <wikibugs>	 (03PS1) 10Elukey: role::kafka::main: move to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619)
[09:24:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: update-production-images: fix docker-pkg invokation [puppet] - 10https://gerrit.wikimedia.org/r/976661
[09:25:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez)
[09:26:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53702 and previous config saved to /var/cache/conftool/dbconfig/20231122-092601-arnaudb.json
[09:26:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53703 and previous config saved to /var/cache/conftool/dbconfig/20231122-092609-arnaudb.json
[09:27:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[09:27:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:30:10] <vgutierrez>	 ^^ that's gonna trigger some pybal config alerts, totally expected
[09:30:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2013.codfw.wmnet with OS bullseye
[09:30:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed: - ms-fe2013 (**PASS**)   - Downtimed on Ici...
[09:31:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey)
[09:34:00] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage
[09:34:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-role for role: kafka::main
[09:34:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kafka::main: move to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976660 (https://phabricator.wikimedia.org/T349619) (owner: 10Elukey)
[09:35:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976659 (owner: 10Elukey)
[09:35:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host wcqs2001.codfw.wmnet
[09:36:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663
[09:36:54] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage
[09:39:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch wcqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976664 (https://phabricator.wikimedia.org/T349619)
[09:40:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::main
[09:41:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53704 and previous config saved to /var/cache/conftool/dbconfig/20231122-094106-arnaudb.json
[09:41:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53705 and previous config saved to /var/cache/conftool/dbconfig/20231122-094114-arnaudb.json
[09:43:21] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10elukey)
[09:44:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch wcqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976664 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:46:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::base::certificates: rename Puppet's CA file [puppet] - 10https://gerrit.wikimedia.org/r/976659 (owner: 10Elukey)
[09:47:13] <elukey>	 !log Update of the profile::base::certificate's CA bundle fleet wide (https://gerrit.wikimedia.org/r/c/operations/puppet/+/976659)
[09:47:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:42] <wikibugs>	 (03PS7) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069)
[09:48:49] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface: Allow creating IPIP interfaces w/o an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:49:11] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye
[09:49:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye completed: - ms-fe1013 (**PASS**)   - Downtimed on Ici...
[09:49:55] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[09:50:42] <vgutierrez>	 elukey: we don't have a task for that not scary at all change?
[09:51:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host wcqs2001.codfw.wmnet
[09:51:56] <elukey>	 vgutierrez: it is a follow up after some work that John did (upgrade wmf-certificates), I think it is part of the puppet 7's migration. Since the crt content is the same no change is triggered, but I logged it for awareness
[09:52:57] <wikibugs>	 (03PS4) 10Elukey: role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658
[09:53:27] <vgutierrez>	 !log rolling restart of pybal to catch up on a NOOP config update - T351069
[09:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:32] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[09:53:52] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs (T351069)
[09:56:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53706 and previous config saved to /var/cache/conftool/dbconfig/20231122-095611-arnaudb.json
[09:56:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53707 and previous config saved to /var/cache/conftool/dbconfig/20231122-095619-arnaudb.json
[09:56:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::ml_cache::storage: remove override [puppet] - 10https://gerrit.wikimedia.org/r/976658 (owner: 10Elukey)
[09:59:33] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: swift::storage
[10:05:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch swift::storage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976665 (https://phabricator.wikimedia.org/T349619)
[10:07:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch swift::storage to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976665 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:07:29] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll restart after change in the CA bundle - elukey@cumin1001
[10:11:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53708 and previous config saved to /var/cache/conftool/dbconfig/20231122-101116-arnaudb.json
[10:11:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53709 and previous config saved to /var/cache/conftool/dbconfig/20231122-101124-arnaudb.json
[10:14:08] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1025 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:18] <wikibugs>	 (03PS4) 10Zoranzoki21: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus)
[10:21:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: switch 15% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976218 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[10:21:46] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs (T351069)
[10:21:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663 (owner: 10Giuseppe Lavagetto)
[10:21:51] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[10:22:11] <jnuche>	 jouncebot: nowandnext
[10:22:11] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 37 minute(s)
[10:22:11] <jouncebot>	 In 0 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1100)
[10:23:34] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@0cca675] (releasing): (no justification provided)
[10:24:15] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@0cca675] (releasing): (no justification provided) (duration: 00m 40s)
[10:25:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll restart after change in the CA bundle - elukey@cumin1001
[10:25:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] weekly-update: skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/976663 (owner: 10Giuseppe Lavagetto)
[10:25:36] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:25:39] <wikibugs>	 (03PS8) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069)
[10:25:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll restart after change in the CA bundle - elukey@cumin1001
[10:25:48] <wikibugs>	 (03PS3) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069)
[10:25:58] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:26:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53710 and previous config saved to /var/cache/conftool/dbconfig/20231122-102621-arnaudb.json
[10:26:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53711 and previous config saved to /var/cache/conftool/dbconfig/20231122-102629-arnaudb.json
[10:26:52] <Emperor>	 !log repool ms-fe1013 with new envoy TLS setup T317616
[10:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:58] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[10:27:28] <wikibugs>	 (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect vrt-wiki.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933475 (owner: 10Clément Goubert)
[10:27:32] <wikibugs>	 (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect www.mediawiki.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933470 (owner: 10Clément Goubert)
[10:27:35] <Emperor>	 !log repool ms-fe2013 with new envoy TLS setup T317616
[10:27:36] <wikibugs>	 (03Abandoned) 10Clément Goubert: Revert "mw-on-k8s: Redirect office.wikimedia.org to mw-on-k8s" [puppet] - 10https://gerrit.wikimedia.org/r/933468 (owner: 10Clément Goubert)
[10:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:02] <wikibugs>	 (03Abandoned) 10Clément Goubert: mw-api-int: Raise number of replicas to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/941900 (https://phabricator.wikimedia.org/T342252) (owner: 10Clément Goubert)
[10:28:37] <wikibugs>	 (03Abandoned) 10Clément Goubert: mw-on-k8s: Revert sending traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/935673 (https://phabricator.wikimedia.org/T341078) (owner: 10Clément Goubert)
[10:30:51] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) I tested a revert to `gtls` for centrallog hosts (the receiver part only), rsyslog now stays silent on centrallog though I still see the (re) connections fr...
[10:31:10] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[10:32:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: swift::storage
[10:33:02] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[10:33:03] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[10:33:35] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[10:34:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:37:55] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.service-route: customize lock args (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[10:38:58] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[10:40:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I think you still need to overwrite "command" with an empty value in values.yaml in order to actually use the entrypoint" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[10:40:11] <wikibugs>	 (03PS9) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069)
[10:40:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[10:40:40] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[10:41:18] <wikibugs>	 (03PS4) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069)
[10:41:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53712 and previous config saved to /var/cache/conftool/dbconfig/20231122-104126-arnaudb.json
[10:41:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 70%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53713 and previous config saved to /var/cache/conftool/dbconfig/20231122-104134-arnaudb.json
[10:42:32] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/636/con" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:43:20] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll restart after change in the CA bundle - elukey@cumin1001
[10:45:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "The logic looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/976163 (owner: 10Volans)
[10:46:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-db1002.eqiad.wmnet
[10:48:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch an-db1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976667 (https://phabricator.wikimedia.org/T349619)
[10:50:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch an-db1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976667 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:52:34] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:52:52] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "looks ok, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:53:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] update-production-images: fix docker-pkg invokation [puppet] - 10https://gerrit.wikimedia.org/r/976661 (owner: 10Giuseppe Lavagetto)
[10:54:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-db1002.eqiad.wmnet
[10:54:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:55:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service, kubernetes: mw-jobrunner fixes [puppet] - 10https://gerrit.wikimedia.org/r/976191 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:55:51] <_joe_>	 hnowlan: merged your change too
[10:55:55] <hnowlan>	 thanks
[10:56:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53714 and previous config saved to /var/cache/conftool/dbconfig/20231122-105631-arnaudb.json
[10:56:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 80%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53715 and previous config saved to /var/cache/conftool/dbconfig/20231122-105639-arnaudb.json
[10:58:56] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service Hnowlan Awaiting discovery records being created https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:58:56] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly Hnowlan Awaiting discovery records being created https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:59:04] <wikibugs>	 (03PS3) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[10:59:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] "jobrunner is active/passive iirc :)" [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1100)
[11:01:23] <wikibugs>	 (03PS4) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[11:02:14] <wikibugs>	 (03CR) 10Hnowlan: wmnet: add mw-jobrunner discovery record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:02:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:02:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] wmnet: add mw-jobrunner discovery record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:03:38] <wikibugs>	 (03PS5) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[11:04:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:06:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:06:46] <wikibugs>	 (03PS6) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[11:09:03] <claime>	 q/19
[11:10:40] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53716 and previous config saved to /var/cache/conftool/dbconfig/20231122-111136-arnaudb.json
[11:11:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53717 and previous config saved to /var/cache/conftool/dbconfig/20231122-111144-arnaudb.json
[11:13:56] <icinga-wm>	 PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[11:16:54] <wikibugs>	 (03PS6) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796)
[11:16:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] docker::reports: change ownership of base rebuild job [puppet] - 10https://gerrit.wikimedia.org/r/976198 (owner: 10Jbond)
[11:18:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[11:21:50] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616)
[11:21:52] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616)
[11:21:54] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616)
[11:21:56] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616)
[11:21:58] <wikibugs>	 (03PS1) 10MVernon: hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616)
[11:22:00] <wikibugs>	 (03PS1) 10MVernon: hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616)
[11:23:07] <hashar>	 I am going to restart Gerrit
[11:23:18] <arnaudb>	 😱
[11:23:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:23:36] <hashar>	 best case scenario it comes back after couple minutes
[11:23:50] <hashar>	 worse case scenario a handful of us cancels our plans for the next few days while bring it back up
[11:23:52] <hashar>	 (kidding)
[11:24:11] <arnaudb>	 I'm off to eat in about 15min, so I hope your worst case scenario is gzip-able :D
[11:25:46] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:25:57] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:07] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:19] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:22] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] "I love that moss stayed!" [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:31] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:33] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:37] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:39] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53718 and previous config saved to /var/cache/conftool/dbconfig/20231122-112641-arnaudb.json
[11:26:44] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53719 and previous config saved to /var/cache/conftool/dbconfig/20231122-112649-arnaudb.json
[11:26:49] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:26:57] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:29:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976273 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[11:30:54] <wikibugs>	 (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[11:31:44] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130)
[11:33:15] <wikibugs>	 (03PS3) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[11:33:42] <Emperor>	 !log depool ms-fe1012 to reimage with new envoy TLS setup T317616
[11:33:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::postgresql
[11:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:48] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[11:34:05] <Emperor>	 !log depool ms-fe2012 to reimage with new envoy TLS setup T317616
[11:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:14] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976672 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[11:35:30] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1012.eqiad.wmnet with OS bullseye
[11:35:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye
[11:35:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2012.codfw.wmnet with OS bullseye
[11:35:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch analytics_cluster::postgresql to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976678 (https://phabricator.wikimedia.org/T349619)
[11:35:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye
[11:36:36] <wikibugs>	 (03PS1) 10Jbond: Revert "prometheus: update to request testing certs from pki" [puppet] - 10https://gerrit.wikimedia.org/r/976574
[11:36:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::postgresql to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976678 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:36:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "prometheus: update to request testing certs from pki" [puppet] - 10https://gerrit.wikimedia.org/r/976574 (owner: 10Jbond)
[11:37:28] <wikibugs>	 (03PS1) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624)
[11:38:20] <wikibugs>	 (03PS1) 10Hashar: gerrit: accept SIGINT as a valid exit code [puppet] - 10https://gerrit.wikimedia.org/r/976679
[11:42:35] <wikibugs>	 (03PS2) 10Jbond: prometheus: update to request testing certs from pki [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624)
[11:42:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[11:43:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::postgresql
[11:44:03] <wikibugs>	 (03CR) 10Jbond: "This did not solve the original issue but it didn't seem to break anything either.  As such i think it still may be worth considering" [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[11:46:04] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:46:23] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) >>! In T351624#9350064, @jbond wrote: > @fgiunchedi [[ https://g...
[11:47:32] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage
[11:50:05] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage
[11:50:10] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] cxserver: Force 127.0.0.1 instead of localhost (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry)
[11:50:21] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage
[11:52:26] <wikibugs>	 (03CR) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry)
[11:53:17] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage
[11:55:53] <wikibugs>	 (03CR) 10Hashar: "After a `systemctl restart gerrit` the journal mark a failure due to the JVM exiting with code 130:" [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar)
[11:56:09] <hashar>	 !log Restarting Gerrit
[11:56:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:51] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikipedia.com has 86348 seconds left https://wikitech.wikimedia.org/wiki/Ncredir
[12:00:59] <wikibugs>	 (03PS1) 10Btullis: Add another public endpoint to our matomo installation [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910)
[12:01:24] <wikibugs>	 (03PS14) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427)
[12:01:26] <wikibugs>	 (03PS1) 10Majavah: openstack: update wiki replica DNS to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/976688 (https://phabricator.wikimedia.org/T346947)
[12:01:56] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert)
[12:02:22] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/637/con" [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[12:03:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[12:05:07] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122)
[12:05:25] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1012.eqiad.wmnet with OS bullseye
[12:05:38] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye completed: - ms-fe1012 (**PASS**...
[12:08:56] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[12:09:41] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:09:57] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Looks good! Should we remove all oozie-related jobs from refinery as well?" [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[12:09:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. Sincere thanks for all of your work on this." [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[12:10:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2012.codfw.wmnet with OS bullseye
[12:10:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2012.codfw.wmnet with OS bullseye completed: - ms-fe2012 (**PASS**...
[12:10:57] <icinga-wm>	 RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:00] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[12:11:23] <wikibugs>	 (03PS1) 10Jbond: puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690
[12:11:49] <wikibugs>	 (03CR) 10Brouberol: set druid hosts to use the reuse partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene)
[12:12:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond)
[12:14:38] <wikibugs>	 (03CR) 10Stevemunene: set druid hosts to use the reuse partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene)
[12:15:08] <Emperor>	 !log repool ms-fe2012 with new envoy TLS setup T317616
[12:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:17] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[12:15:30] <wikibugs>	 (03PS1) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796)
[12:15:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "@jesses this relates to the git-sync-upstream in wmcs which" [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond)
[12:15:52] <Emperor>	 !log repool ms-fe1012 with new envoy TLS setup T317616
[12:15:52] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet
[12:15:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:18:51] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[12:19:15] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:19:25] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:19:42] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1001.eqiad.wmnet with OS bullseye
[12:20:48] <wikibugs>	 (03PS2) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796)
[12:21:55] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:22:09] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:23:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] mw-jobrunner: add vhost for jobrunner.discovery.wmnet (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[12:24:39] <wikibugs>	 (03PS8) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565)
[12:24:41] <wikibugs>	 (03PS17) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:24:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:24:43] <wikibugs>	 (03PS13) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:24:45] <wikibugs>	 (03PS16) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:24:47] <wikibugs>	 (03PS16) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:24:49] <wikibugs>	 (03PS2) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:25:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:25:57] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:26:19] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:26:41] <wikibugs>	 (03PS42) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[12:27:19] <logmsgbot>	 !log taavi@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudlb1002.eqiad.wmnet
[12:27:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] spark: add support for spark-history on the spark image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896363 (https://phabricator.wikimedia.org/T330176) (owner: 10Nicolas Fraison)
[12:29:54] <wikibugs>	 (03PS3) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796)
[12:30:30] <wikibugs>	 (03CR) 10Hnowlan: mw-jobrunner: add vhost for jobrunner.discovery.wmnet (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[12:30:40] <wikibugs>	 (03PS9) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565)
[12:30:42] <wikibugs>	 (03PS18) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:30:44] <wikibugs>	 (03PS14) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:30:46] <wikibugs>	 (03PS17) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:30:48] <wikibugs>	 (03PS17) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:30:50] <wikibugs>	 (03PS3) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:31:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc-gp1001.eqiad.wmnet
[12:32:19] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1001.eqiad.wmnet with reason: host reimage
[12:32:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mc-gp1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976694 (https://phabricator.wikimedia.org/T349619)
[12:35:23] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1001.eqiad.wmnet with reason: host reimage
[12:35:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mc-gp1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976694 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:37:09] <wikibugs>	 (03Abandoned) 10Clément Goubert: mw-api-ext, mw-web: raise replicas for traffic bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/961341 (https://phabricator.wikimedia.org/T346422) (owner: 10Clément Goubert)
[12:38:26] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122)
[12:38:51] <wikibugs>	 (03PS1) 10Majavah: cloudlb: wikireplicas: fix frontend filename [puppet] - 10https://gerrit.wikimedia.org/r/976695
[12:38:53] <wikibugs>	 (03PS1) 10Majavah: cloudlb: wikireplicas: fix go template syntax [puppet] - 10https://gerrit.wikimedia.org/r/976696
[12:39:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc-gp1001.eqiad.wmnet
[12:40:27] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976696 (owner: 10Majavah)
[12:41:07] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cloudlb: wikireplicas: fix frontend filename [puppet] - 10https://gerrit.wikimedia.org/r/976695 (owner: 10Majavah)
[12:41:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] cloudlb: wikireplicas: fix go template syntax [puppet] - 10https://gerrit.wikimedia.org/r/976696 (owner: 10Majavah)
[12:42:41] <wikibugs>	 (03PS19) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:42:43] <wikibugs>	 (03PS15) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:42:45] <wikibugs>	 (03PS18) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:42:47] <wikibugs>	 (03PS18) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:42:49] <wikibugs>	 (03PS4) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:43:29] <wikibugs>	 (03Abandoned) 10JMeybohm: api-gateway: specify config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/974609 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[12:43:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[12:44:44] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976673 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[12:44:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[12:45:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:45:32] <Emperor>	 !log depool ms-fe2011 to reimage with new envoy TLS setup T317616
[12:45:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:36] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[12:45:41] <Emperor>	 !log depool ms-fe1011 to reimage with new envoy TLS setup T317616
[12:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:46:03] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review, 10User-fgiunchedi: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) I have refreshed [[ https://gerrit.wikimedia.org/r/c/operations/...
[12:46:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:46:21] <Emperor>	 jayme: LMK when you're doing running puppet-merge?
[12:46:35] <jayme>	 Emperor: done
[12:46:38] <Emperor>	 ta
[12:46:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:47:34] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS bullseye
[12:47:44] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye
[12:47:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2011.codfw.wmnet with OS bullseye
[12:48:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye
[12:48:06] <wikibugs>	 (03PS43) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[12:48:10] <wikibugs>	 (03PS10) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565)
[12:48:14] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[12:48:14] <wikibugs>	 (03PS20) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:48:22] <wikibugs>	 (03PS16) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:48:26] <wikibugs>	 (03PS19) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:48:30] <wikibugs>	 (03PS19) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:48:34] <wikibugs>	 (03PS5) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:49:30] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2420.codfw.wmnet
[12:50:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host logstash2001.codfw.wmnet
[12:51:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:51:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:51:49] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2421.codfw.wmnet
[12:52:26] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2425.codfw.wmnet
[12:52:34] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw2431.codfw.wmnet
[12:52:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:53:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch logstash2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976699 (https://phabricator.wikimedia.org/T349619)
[12:53:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:53:33] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1001.eqiad.wmnet with OS bullseye
[12:54:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert)
[12:54:37] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1472.eqiad.wmnet
[12:54:38] <wikibugs>	 (03PS44) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[12:54:40] <wikibugs>	 (03PS11) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565)
[12:54:42] <wikibugs>	 (03PS21) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:54:44] <wikibugs>	 (03PS17) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:54:46] <wikibugs>	 (03PS20) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:54:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1473.eqiad.wmnet
[12:54:48] <wikibugs>	 (03PS20) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:54:50] <wikibugs>	 (03PS6) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:54:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert)
[12:55:25] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1475.eqiad.wmnet
[12:55:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:55:46] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.puppet.migrate-host for host mw1474.eqiad.wmnet
[12:56:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[12:56:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch logstash2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976699 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 25% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/976689 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert)
[12:57:34] <wikibugs>	 (03PS22) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[12:57:36] <wikibugs>	 (03PS18) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[12:57:38] <wikibugs>	 (03PS21) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[12:57:40] <wikibugs>	 (03PS21) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[12:57:42] <wikibugs>	 (03PS7) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[12:58:01] <wikibugs>	 (03PS1) 10Btullis: airflow: change max_active_runs_per_dag back to 1 [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388)
[12:58:33] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2420.codfw.wmnet
[12:59:05] <claime>	 !log Raising mw-web and mw-api-ext replicas for traffic bump - T348122
[12:59:08] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet
[12:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:10] <stashbot>	 T348122: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122
[12:59:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:59:31] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage
[12:59:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:59:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/644/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:00:28] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/645/con" [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[13:00:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[13:00:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[13:00:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host logstash2001.codfw.wmnet
[13:01:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:01:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[13:01:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[13:01:33] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:01:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[13:01:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host logstash2023.codfw.wmnet
[13:01:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[13:01:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[13:02:12] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage
[13:02:30] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage
[13:02:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch logstash2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976701 (https://phabricator.wikimedia.org/T349619)
[13:05:04] <wikibugs>	 (03PS45) 10Jbond: prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[13:05:06] <wikibugs>	 (03PS12) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565)
[13:05:08] <wikibugs>	 (03PS23) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565)
[13:05:10] <wikibugs>	 (03PS19) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565)
[13:05:12] <wikibugs>	 (03PS22) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565)
[13:05:13] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage
[13:05:14] <wikibugs>	 (03PS22) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565)
[13:05:16] <wikibugs>	 (03PS8) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565)
[13:07:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch logstash2023 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976701 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:12:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host logstash2023.codfw.wmnet
[13:13:30] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2017 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:17:01] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond)
[13:17:25] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS bullseye
[13:17:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye completed: - ms-fe1011 (**PASS**...
[13:19:09] <wikibugs>	 (03PS1) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713)
[13:21:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2011.codfw.wmnet with OS bullseye
[13:21:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2011.codfw.wmnet with OS bullseye completed: - ms-fe2011 (**PASS**...
[13:22:51] <Emperor>	 !log repool ms-fe1011 with new envoy TLS setup T317616
[13:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:55] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[13:23:44] <Emperor>	 !log repool ms-fe2011 with new envoy TLS setup T317616
[13:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:01] <wikibugs>	 (03PS2) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713)
[13:24:20] <wikibugs>	 (03PS1) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732
[13:24:36] <wikibugs>	 (03PS2) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732 (https://phabricator.wikimedia.org/T351074)
[13:25:08] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:28] <wikibugs>	 (03PS3) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713)
[13:27:08] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976674 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[13:27:42] <Emperor>	 !log depool ms-fe1010 to reimage with new envoy TLS setup T317616
[13:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:53] <Emperor>	 !log depool ms-fe2010 to reimage with new envoy TLS setup T317616
[13:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:05] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[13:28:38] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1032 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:21] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1010.eqiad.wmnet with OS bullseye
[13:29:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye
[13:29:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2010.codfw.wmnet with OS bullseye
[13:29:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye
[13:30:40] <wikibugs>	 (03PS1) 10Brouberol: Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711)
[13:37:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53722 and previous config saved to /var/cache/conftool/dbconfig/20231122-133741-arnaudb.json
[13:37:47] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:38:46] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:22] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage
[13:42:47] <wikibugs>	 (03PS1) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734
[13:42:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host logstash2023.codfw.wmnet
[13:43:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage
[13:44:06] <wikibugs>	 (03PS1) 10Majavah: P:etcd: generate wiki replica pool accounts [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947)
[13:44:06] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage
[13:44:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/646/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah)
[13:45:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah)
[13:45:25] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/647/con" [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[13:46:06] <wikibugs>	 (03PS5) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[13:46:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[13:47:05] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage
[13:47:09] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add another public endpoint to our matomo installation [puppet] - 10https://gerrit.wikimedia.org/r/976686 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[13:47:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver - wmcs: add post-merge hook to WMCS puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/976690 (owner: 10Jbond)
[13:47:14] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2421.codfw.wmnet
[13:47:15] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/648/con" [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[13:47:19] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2431.codfw.wmnet
[13:47:21] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw2425.codfw.wmnet
[13:47:24] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1473.eqiad.wmnet
[13:47:26] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1472.eqiad.wmnet
[13:47:53] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1474.eqiad.wmnet
[13:47:56] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host mw1475.eqiad.wmnet
[13:48:09] <jbond>	 btullis: happy for me to merge your cr
[13:48:23] <btullis>	 Yes please.
[13:48:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2023.codfw.wmnet
[13:49:08] <wikibugs>	 (03PS2) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734
[13:50:28] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/649/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah)
[13:51:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah)
[13:52:06] <wikibugs>	 (03CR) 10Jbond: "kindly  review" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:52:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host logstash2001.codfw.wmnet
[13:52:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53723 and previous config saved to /var/cache/conftool/dbconfig/20231122-135248-arnaudb.json
[13:53:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[13:54:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/650/con" [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:56:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/651/con" [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:56:49] <wikibugs>	 (03CR) 10BBlack: "On the topic of ferm::service changes: IMHO, this isn't the place to do those refactors/upgrades of the existing ferm puppetization.  That" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[13:57:55] <wikibugs>	 (03CR) 10Jbond: "I think we probably wan't to do this fleet wide.  or consider if we want to for some time have one central log with the new certs and one " [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[13:57:58] <wikibugs>	 (03PS1) 10Majavah: cloudlb: explicitely bind openstack mysql to ip [puppet] - 10https://gerrit.wikimedia.org/r/976736
[13:58:58] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1010.eqiad.wmnet with OS bullseye
[13:59:07] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069)
[13:59:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye completed: - ms-fe1010 (**PASS**...
[13:59:17] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/976736 (owner: 10Majavah)
[14:00:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: don't use srv records [puppet] - 10https://gerrit.wikimedia.org/r/976738
[14:00:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add pontoon log bullseye [puppet] - 10https://gerrit.wikimedia.org/r/976739
[14:00:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: varnishkafka: move to rsyslog::conf [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799)
[14:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1400).
[14:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: support alternative base in ::conf [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799)
[14:00:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog_exporter: move to a define [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799)
[14:00:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP separate receiver rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799)
[14:00:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fetch from rsyslog-receiver exporter [puppet] - 10https://gerrit.wikimedia.org/r/976744 (https://phabricator.wikimedia.org/T351799)
[14:00:28] <godog>	 standing by for jenkins -1s
[14:00:54] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I'm not sure it works as-is, that's also inherited by PuppetMaster so it should work there too or we should override it and raise" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:01:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol)
[14:01:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2001.codfw.wmnet
[14:02:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2010.codfw.wmnet with OS bullseye
[14:03:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2010.codfw.wmnet with OS bullseye completed: - ms-fe2010 (**PASS**...
[14:06:22] <wikibugs>	 (03PS1) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745
[14:06:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:07:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P53724 and previous config saved to /var/cache/conftool/dbconfig/20231122-140754-arnaudb.json
[14:08:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/976575 (https://phabricator.wikimedia.org/T351624) (owner: 10Jbond)
[14:08:45] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond)
[14:09:27] <wikibugs>	 (03CR) 10Btullis: "You could add a PCC run for `Hosts: P:kubernetes::deployment_server or similar." [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol)
[14:09:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:09:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:10:11] <wikibugs>	 (03PS2) 10Brouberol: Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711)
[14:10:19] <wikibugs>	 (03PS2) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745
[14:10:30] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol)
[14:11:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:11:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:11:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[14:12:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond)
[14:12:20] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me in principle, but I haven't ever touched this code before." [puppet] - 10https://gerrit.wikimedia.org/r/976735 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[14:12:22] <wikibugs>	 (03CR) 10Volans: [C: 04-1] puppet: add hiera_lookup function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:12:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore2001.codfw.wmnet
[14:12:53] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: bump envoy image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/976206 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[14:14:16] <Emperor>	 !log repool ms-fe1010 with new envoy TLS setup T317616
[14:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:23] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[14:14:35] <wikibugs>	 (03PS3) 10Jbond: puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745
[14:14:52] <Emperor>	 !log repool ms-fe2010 with new envoy TLS setup T317616
[14:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:19] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976675 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[14:18:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976747 (https://phabricator.wikimedia.org/T349619)
[14:19:03] <Emperor>	 !log depool ms-fe1009 to reimage with new envoy TLS setup T317616
[14:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:12] <Emperor>	 !log depool ms-fe2009 to reimage with new envoy TLS setup T317616
[14:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2001.codfw.wmnet
[14:19:32] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2001.codfw.wmnet
[14:19:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS bullseye
[14:20:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1087 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1087 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:20:39] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1009.eqiad.wmnet with OS bullseye
[14:20:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976747 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:20:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye
[14:21:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2009.codfw.wmnet with OS bullseye
[14:21:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye
[14:21:14] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1106 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet:agent: change error to warning [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond)
[14:21:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:38] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet:agent: change error to warning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976745 (owner: 10Jbond)
[14:21:54] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:21:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1141 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:28] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1121 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: don't use srv records [puppet] - 10https://gerrit.wikimedia.org/r/976738 (owner: 10Filippo Giunchedi)
[14:22:54] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add pontoon log bullseye [puppet] - 10https://gerrit.wikimedia.org/r/976739 (owner: 10Filippo Giunchedi)
[14:23:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T348183)', diff saved to https://phabricator.wikimedia.org/P53725 and previous config saved to /var/cache/conftool/dbconfig/20231122-142301-arnaudb.json
[14:23:03] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[14:23:06] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[14:23:07] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:23:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:23:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53726 and previous config saved to /var/cache/conftool/dbconfig/20231122-142312-arnaudb.json
[14:23:15] <godog>	 jbond: I'll merge your patch too
[14:24:16] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:34] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:24:42] <jbond>	 godog: please
[14:24:45] <jbond>	 thanks
[14:24:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:25:12] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) I hit this problem when re-imaging `ms-fe*` nodes (for T317616). Most of them PXE booted fine, but two didn't - ms-fe2014.codfw.wmnet needed one further reboot (which I...
[14:25:30] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon)
[14:25:33] <wikibugs>	 (03PS6) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459
[14:25:58] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:26:00] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore2001.codfw.wmnet
[14:26:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/653/con" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[14:27:22] <wikibugs>	 (03Abandoned) 10Jgiannelos: changeprop: Disable restbase pregeneration on PCS [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[14:27:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "See PCC, this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[14:28:01] <wikibugs>	 (03PS7) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459
[14:29:39] <wikibugs>	 (03PS8) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:29:51] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update docker images to latest versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/976748 (https://phabricator.wikimedia.org/T347551)
[14:29:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1087 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:30:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:30:31] <wikibugs>	 (03CR) 10Jbond: puppet: add hiera_lookup function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:30:59] <urandom>	 !log restarting Cassandra, sessionstore2001 (post-Puppet 7 migration)
[14:31:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: varnishkafka: move to rsyslog::conf [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799)
[14:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:04] <wikibugs>	 (03PS2) 10Filippo Giunchedi: rsyslog: support alternative base in ::conf [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799)
[14:31:06] <wikibugs>	 (03PS2) 10Filippo Giunchedi: rsyslog_exporter: move to a define [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799)
[14:31:08] <wikibugs>	 (03PS2) 10Filippo Giunchedi: rsyslog: ship a separate 'receiver' instance [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799)
[14:31:10] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fetch from rsyslog-receiver exporter [puppet] - 10https://gerrit.wikimedia.org/r/976744 (https://phabricator.wikimedia.org/T351799)
[14:31:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:32:01] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1119 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:28] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage
[14:32:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:32:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:32:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:40] <claime>	 jouncebot: nowandnext
[14:32:40] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1400)
[14:32:40] <jouncebot>	 In 0 hour(s) and 27 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1500)
[14:33:05] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos)
[14:33:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 25% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964449 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert)
[14:34:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:34:10] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage
[14:34:40] <fabfur>	 !log start re-provisioning and re-imaging cp1113 to fix wrong subnet (T342159)
[14:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:44] <stashbot>	 T342159: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159
[14:35:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage
[14:35:13] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-fe2009.codfw.wmnet with reason: host reimage
[14:35:32] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1009.eqiad.wmnet with reason: host reimage
[14:36:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) 05Stalled→03In progress
[14:37:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:38:03] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage
[14:38:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:38:21] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:23] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:31] <wikibugs>	 (03PS9) 10Volans: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:39:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:41:07] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:24] <wikibugs>	 (03Abandoned) 10Jgiannelos: tegola: Enable structured logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/963289 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[14:41:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:42:09] <icinga-wm>	 PROBLEM - MD RAID on ms-fe2009 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:43:23] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:53] <wikibugs>	 (03PS1) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649)
[14:44:15] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:44:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:51] <wikibugs>	 (03Abandoned) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[14:47:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:47:21] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-codfw.service,fetch-rings-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:48:13] <icinga-wm>	 PROBLEM - Host ms-fe2009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:37] <wikibugs>	 (03PS1) 10Btullis: Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910)
[14:49:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[14:49:45] <icinga-wm>	 RECOVERY - Host ms-fe2009 is UP: PING OK - Packet loss = 0%, RTA = 31.86 ms
[14:49:59] <jinxer-wm>	 (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:50:02] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/654/con" [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[14:50:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts cp1113.eqiad.wmnet
[14:51:50] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ayounsi) Can you try this? T348119#9224341  Fun fact, I found that task on Google after starting to look for that specific Broadcom PXE string.
[14:52:19] <icinga-wm>	 RECOVERY - MD RAID on ms-fe2009 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:52:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:07] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:13] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:21] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: fetch-rings-codfw.service,fetch-rings-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:53:23] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:53:45] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:54:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:54:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:31] <wikibugs>	 (03Merged) 10jenkins-bot: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[14:54:47] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:54:47] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1009.eqiad.wmnet with OS bullseye
[14:54:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye completed: - ms-fe1009 (**PASS**...
[14:55:14] <wikibugs>	 (03CR) 10Jbond: "@andrea, you may have noticed that i have based a change set of min on top of yours.  I plan to merge that change set on Tuesday with fili" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[14:56:04] <wikibugs>	 (03PS2) 10Btullis: Fix Matomo TagManager functionality [puppet] - 10https://gerrit.wikimedia.org/r/976750 (https://phabricator.wikimedia.org/T349910)
[14:56:11] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:56:38] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2001.codfw.wmnet with OS bullseye
[14:56:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:57:07] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2009.codfw.wmnet with OS bullseye
[14:57:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye completed: - ms-fe2009 (**WARN**...
[14:57:44] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS bullseye
[14:58:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:48] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752
[14:59:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host wdqs2008.codfw.wmnet
[14:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1106 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:59:49] <Emperor>	 !log repool ms-fe2009 with new envoy TLS setup T317616
[14:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:53] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[15:00:02] <Emperor>	 !log repool ms-fe1009 with new envoy TLS setup T317616
[15:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:07] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1500)
[15:00:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1147 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1122 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:21] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1122 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1151 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:40] <jayme>	 !log uncordoned and repooled kubernetes1013
[15:00:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch wdqs2008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976753 (https://phabricator.wikimedia.org/T349619)
[15:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:34] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[15:01:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976740 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[15:01:43] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) @ayounsi: Just as another data point, I did check this (twice for many cp hosts) and all had the correct boot order. Someone should confir...
[15:01:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:02:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch wdqs2008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976753 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:02:46] <Emperor>	 !log depool moss-fe2001 to reimage with new envoy TLS setup T317616
[15:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:52] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2] rsync: ensure daemon is started after config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[15:02:55] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] rsync: ensure daemon is started after config [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[15:02:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1143 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1143 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:11] <Emperor>	 !log depool moss-fe1001 to reimage with new envoy TLS setup T317616
[15:03:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:03:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:20] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move two more swift frontends to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976676 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[15:03:57] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1150 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:09] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[15:04:35] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:04:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM (comments are just fyi's)" [puppet] - 10https://gerrit.wikimedia.org/r/976741 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[15:04:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:04:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye
[15:05:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye
[15:05:13] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:13] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:29] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1122 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:05:45] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752 (owner: 10Volans)
[15:06:29] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:06:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976742 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[15:06:54] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp1113.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[15:06:55] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:06:55] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cp1113.eqiad.wmnet
[15:06:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host wdqs2008.codfw.wmnet
[15:07:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `cp1113.eqiad.wmnet` - cp1113.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanag...
[15:08:05] <wikibugs>	 (03PS2) 10Dr0ptp4kt: wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric)
[15:08:35] <wikibugs>	 (03Abandoned) 10JMeybohm: sre.hosts.reimage: Allow to skip puppet migration [cookbooks] - 10https://gerrit.wikimedia.org/r/976732 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm)
[15:08:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "puppet wise lgtm ill leave it for someone else to review the rsyslog stuff" [puppet] - 10https://gerrit.wikimedia.org/r/976743 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi)
[15:08:51] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[15:08:53] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10lmata)
[15:09:37] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric)
[15:09:37] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:24] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric)
[15:11:55] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage
[15:12:38] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033)
[15:12:41] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[15:13:18] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/976752 (owner: 10Volans)
[15:14:29] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758
[15:14:38] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage
[15:15:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] api-gateway: use enovy.yaml in place of config.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[15:15:21] <moritzm>	 !log installing python3.7 security updates
[15:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:27] <wikibugs>	 (03PS2) 10Kamila Součková: mobileapps: 20% to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976219 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[15:16:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:29] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:17:25] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[15:17:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:18:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:19:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[15:19:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:20:19] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[15:22:06] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[15:22:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:22:14] <wikibugs>	 (03CR) 10Hashar: "> $ sudo journalctl -u gerrit|grep systemd.*exited" [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar)
[15:22:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:53] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1147 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[15:23:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:23:24] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:30] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10MatthewVernon) I can try if/when I get another one that fails (I'd be surprised if that were the solution, given "enough reboots" seems to have wo...
[15:24:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon)
[15:24:51] <taavi>	 Emperor: ping, puppet-merge is stuck on your patch, see -sre
[15:25:54] <wikibugs>	 (03PS1) 10Majavah: Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775
[15:26:40] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775 (owner: 10Majavah)
[15:26:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758 (owner: 10Volans)
[15:27:57] <wikibugs>	 (03Abandoned) 10Majavah: Revert "hiera: move two more swift frontends to envoy" [puppet] - 10https://gerrit.wikimedia.org/r/976775 (owner: 10Majavah)
[15:28:16] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye
[15:28:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f...
[15:28:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye
[15:28:29] <logmsgbot>	 !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:28:39] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:28:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye
[15:28:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:28:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye executed with errors: - moss-f...
[15:28:59] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:29:43] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:03] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:30:06] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::databases
[15:30:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1175 - jclark@cumin1001"
[15:31:21] <wikibugs>	 (03PS1) 10Jbond: backup::databases: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976761 (https://phabricator.wikimedia.org/T349619)
[15:31:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup::databases: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976761 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:31:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1175 - jclark@cumin1001"
[15:31:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:32:21] <wikibugs>	 (03PS10) 10Vgutierrez: profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069)
[15:33:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:03] <wikibugs>	 (03CR) 10Vgutierrez: "adding Filippo to get his take on the prometheus::ops stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:33:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:33:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host testreduce1002.eqiad.wmnet
[15:34:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM fundamentally, but it's hard to know the outcome in these cases until we try on a real host!" [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:34:20] <Amir1>	 going to make an alter table in s8 in cloud replicas
[15:34:24] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM fundamentally, but it's hard to know the outcome in these cases until we try on a real host!" [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:35:01] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: stunnel4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:26] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::databases
[15:35:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[15:36:01] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2002.codfw.wmnet with OS bullseye
[15:36:49] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::es
[15:37:18] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] airflow: change max_active_runs_per_dag back to 1 [puppet] - 10https://gerrit.wikimedia.org/r/976700 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[15:38:17] <wikibugs>	 (03PS1) 10Jbond: backup::es: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976762 (https://phabricator.wikimedia.org/T349619)
[15:38:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1175.mgmt.eqiad.wmnet with reboot policy FORCED
[15:38:45] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:38:46] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[15:38:56] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v8.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/976758 (owner: 10Volans)
[15:39:42] <wikibugs>	 (03PS1) 10Majavah: hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763
[15:39:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup::es: migrat to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976762 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:39:54] <wikibugs>	 (03CR) 10Brouberol: Define the spark-history/spark-history-test k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976731 (https://phabricator.wikimedia.org/T351713) (owner: 10Brouberol)
[15:40:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch testreduce1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976764 (https://phabricator.wikimedia.org/T349619)
[15:40:15] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:40:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763 (owner: 10Majavah)
[15:40:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: depool web wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976763 (owner: 10Majavah)
[15:40:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[15:41:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch testreduce1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/976764 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:41:59] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[15:42:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:42:50] <volans>	 !log uploaded spicerack_8.2.0 to apt.wikimedia.org bullseye-wikimedia
[15:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:43:35] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::es
[15:43:40] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1001.eqiad.wmnet with reason: host reimage
[15:43:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:44:03] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:44:11] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:44:49] <wikibugs>	 (03PS1) 10Majavah: Revert "hieradata: depool web wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976777
[15:45:03] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2002.codfw.wmnet
[15:45:03] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: backup::production
[15:45:04] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2002.codfw.wmnet
[15:45:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye
[15:45:48] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Revert "hieradata: depool web wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976777 (owner: 10Majavah)
[15:45:52] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[15:46:11] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:46:12] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "`/robots.txt` is indeed shared and it is more or less obsolete or at least a remnant of the past." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg)
[15:46:37] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[15:46:38] <wikibugs>	 (03PS1) 10Majavah: hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786
[15:46:48] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:46:53] <wikibugs>	 (03PS1) 10Ladsgroup: Add virtual domain for botpasswords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976787 (https://phabricator.wikimedia.org/T351559)
[15:47:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host testreduce1002.eqiad.wmnet
[15:47:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786 (owner: 10Majavah)
[15:47:51] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: depool analytics wiki replicas [puppet] - 10https://gerrit.wikimedia.org/r/976786 (owner: 10Majavah)
[15:47:54] <wikibugs>	 (03PS1) 10Jbond: backup::production: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976788 (https://phabricator.wikimedia.org/T349619)
[15:48:30] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[15:48:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup::production: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976788 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[15:49:25] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033)
[15:49:33] <jbond>	 taavi: feel free to mrge mine if promted
[15:49:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:49:45] <taavi>	 jbond: I already merged mine, try again?
[15:49:57] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1151 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:50:52] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gerrit: accept SIGINT as a valid exit code [puppet] - 10https://gerrit.wikimedia.org/r/976679 (owner: 10Hashar)
[15:51:50] <jbond>	 taavi: ck cheers
[15:52:05] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding cp1113 back with correct VLAN - fabfur@cumin1001"
[15:52:35] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1143 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:52:45] <wikibugs>	 (03PS1) 10Majavah: Revert "hieradata: depool analytics wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976778
[15:52:59] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding cp1113 back with correct VLAN - fabfur@cumin1001"
[15:52:59] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:53:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Revert "hieradata: depool analytics wiki replicas" [puppet] - 10https://gerrit.wikimedia.org/r/976778 (owner: 10Majavah)
[15:53:43] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1113
[15:54:29] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye
[15:54:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f...
[15:55:18] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1113
[15:55:29] <moritzm>	 !log installing dpkg bugfix updates on bullseye
[15:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye
[15:56:01] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye
[15:56:43] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:06] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: backup::production
[15:57:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:17] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1001.eqiad.wmnet with OS bullseye
[15:58:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:58:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-fe1001.eqiad.wmnet with OS bullseye completed: - moss-fe1001 (**WA...
[15:58:31] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] P:dns::auth::update: add support for setting ferm rules via confd (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975843 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:59:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] P:dns::auth::update: add support for authdns-update hosts via confd [puppet] - 10https://gerrit.wikimedia.org/r/976254 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:59:13] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:37] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dbbackups::content
[16:00:14] <James_F>	 jouncebot: nowandnext
[16:00:14] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 59 minute(s)
[16:00:14] <jouncebot>	 In 1 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800)
[16:00:21] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882
[16:00:26] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 (owner: 10Jforrester)
[16:00:42] <wikibugs>	 (03CR) 10JMeybohm: Expose Netbox's BGP servers to Homer (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[16:01:19] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Roll back Python evaluator to working version [deployment-charts] - 10https://gerrit.wikimedia.org/r/975882 (owner: 10Jforrester)
[16:01:23] <wikibugs>	 (03PS1) 10Jbond: dbbackups::content: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976790 (https://phabricator.wikimedia.org/T349619)
[16:01:48] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2003.codfw.wmnet with OS bullseye
[16:01:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dbbackups::content: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976790 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:02:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye
[16:02:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Prometheus bits LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:02:16] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:02:20] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:05:20] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:05:30] <sukhe>	 !log disable Puppet on A:lvs to merge CR 976312
[16:05:31] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dbbackups::content
[16:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:05:52] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dbbackups::metadata
[16:06:00] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal: do not install from component [puppet] - 10https://gerrit.wikimedia.org/r/976312 (https://phabricator.wikimedia.org/T348837) (owner: 10Ssingh)
[16:06:33] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[16:07:24] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:07:39] <wikibugs>	 (03PS1) 10Jbond: dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976791 (https://phabricator.wikimedia.org/T349619)
[16:08:02] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:08:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976791 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:08:54] <sukhe>	 !log enable Puppet on A:lvs to merge CR 976312 and run agent
[16:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:00] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:09:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:09:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[16:09:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:10:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:10:14] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:21] <wikibugs>	 (03PS1) 10Jforrester: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779
[16:10:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester)
[16:10:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:11:02] <wikibugs>	 (03PS2) 10Jforrester: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779
[16:11:10] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester)
[16:11:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye
[16:11:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) @Volans Thank you for the information! I have ran through these again and with the help @RobH these should be corrected. Also, virtualization has been...
[16:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wikifunctions: Bump evaluators to 2023-11-20-171133" [deployment-charts] - 10https://gerrit.wikimedia.org/r/976779 (owner: 10Jforrester)
[16:12:51] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dbbackups::metadata
[16:12:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[16:13:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF)
[16:13:13] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:13:18] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:13:32] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2028 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:13:35] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:13:45] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:14:01] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mediabackup::storage
[16:14:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon)
[16:14:25] <Emperor>	 !log repool moss-fe1001 with new envoy TLS setup T317616
[16:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:30] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[16:14:46] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:14:48] <wikibugs>	 (03Merged) 10jenkins-bot: mw-jobrunner: add vhost for jobrunner.discovery.wmnet [deployment-charts] - 10https://gerrit.wikimedia.org/r/976692 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:15:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[16:15:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[16:15:28] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:15:35] <Dreamy_Jazz>	 Betawikis seem to be broken - Cannot log into an account.
[16:15:46] <wikibugs>	 (03PS1) 10Jbond: dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976793 (https://phabricator.wikimedia.org/T349619)
[16:15:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage
[16:15:52] <Dreamy_Jazz>	 Error is Error 1146: Table 'wikishared.loginnotify_seen_net' doesn't exist
[16:15:52] <Dreamy_Jazz>	 Function: LoginNotify\LoginNotify::userIsInCurrentSeenBucket
[16:15:53] <Dreamy_Jazz>	 Query: SELECT 1 FROM `loginnotify_seen_net` WHERE lsn_user = 184252 AND lsn_subnet = -6951683680560312271 AND lsn_time_bucket = 2460 LIMIT 1 
[16:16:05] <Emperor>	 !log depool ms-fe1014 to reimage with new envoy TLS setup T317616
[16:16:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[16:16:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[16:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[16:16:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[16:16:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[16:16:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[16:16:28] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:16:29] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:16:55] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: move final swift frontend to envoy [puppet] - 10https://gerrit.wikimedia.org/r/976677 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[16:18:04] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1014.eqiad.wmnet with OS bullseye
[16:18:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye
[16:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:18:37] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] cloudlb: explicitely bind openstack mysql to ip [puppet] - 10https://gerrit.wikimedia.org/r/976736 (owner: 10Majavah)
[16:18:52] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage
[16:19:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dbbackups::metadata: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976793 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:20:42] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe2001.codfw.wmnet with OS bullseye
[16:20:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye executed with errors: - moss-f...
[16:21:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:21:36] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] api-gateway: use enovy.yaml in place of config.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[16:23:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on kubernetes1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:23:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:24:15] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediabackup::storage
[16:24:47] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: mediabackup::worker
[16:25:13] <wikibugs>	 (03PS1) 10Hnowlan: jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796)
[16:25:45] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[16:25:54] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage
[16:25:55] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[16:26:29] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1473:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:26:42] <wikibugs>	 (03PS1) 10Jbond: mediabackup::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976795 (https://phabricator.wikimedia.org/T349619)
[16:26:51] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: use enovy.yaml in place of config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/976757 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[16:27:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mediabackup::worker: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976795 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[16:28:51] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage
[16:29:57] <wikibugs>	 (03CR) 10Dreamy Jazz: "This caused https://phabricator.wikimedia.org/T351828" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[16:29:59] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:30:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2001.codfw.wmnet with OS bullseye
[16:30:56] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye
[16:31:29] <jinxer-wm>	 (PuppetZeroResources) firing: (6) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:31:32] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mediabackup::worker
[16:31:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:31:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:31:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[16:32:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:33:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on kubernetes1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:34:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2001.codfw.wmnet with reason: host reimage
[16:34:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) (perhaps the moss-fe2001 puppet failures are due to T350809 )
[16:34:59] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:35:11] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage
[16:36:29] <jinxer-wm>	 (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:38:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add a default rsyslog destination for all sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[16:38:18] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1014.eqiad.wmnet with reason: host reimage
[16:38:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on aux-k8s-worker1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:39:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add a default rsyslog destination for all sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[16:40:20] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2003.codfw.wmnet with OS bullseye
[16:41:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:42:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2003.codfw.wmnet
[16:42:09] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2003.codfw.wmnet
[16:42:28] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[16:42:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[16:43:11] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS bullseye
[16:44:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2001.codfw.wmnet with OS bullseye
[16:44:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2001.codfw.wmnet with OS bullseye completed: - moss-fe2001 (**PASS**)   - Downtimed on...
[16:45:59] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[16:46:53] <Emperor>	 !log repool moss-fe2001 with new envoy TLS setup T317616
[16:46:56] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: fix java_start_mem in template [puppet] - 10https://gerrit.wikimedia.org/r/976799
[16:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:57] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[16:47:23] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway)
[16:47:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon)
[16:47:41] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1001"
[16:47:43] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS bullseye
[16:47:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye completed: - cp1113 (**PASS**)   - Remo...
[16:48:15] <wikibugs>	 (03CR) 10Pppery: "No idea. I just followed the convention of the existing files." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery)
[16:48:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:49:54] <wikibugs>	 (03PS3) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734
[16:49:56] <wikibugs>	 (03PS1) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800
[16:50:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway)
[16:50:31] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway)
[16:51:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah)
[16:51:46] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/655/con" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah)
[16:52:42] <wikibugs>	 (03PS2) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800
[16:53:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:55:09] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1014.eqiad.wmnet with OS bullseye
[16:55:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1014.eqiad.wmnet with OS bullseye completed: - ms-fe1014 (**PASS**)   - Downtimed on Ici...
[16:55:53] <volans>	 !log installed spicerack v8.2.0 to the cumin hosts
[16:55:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobrunner: remove php version related checks from httpbb [puppet] - 10https://gerrit.wikimedia.org/r/976794 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:22] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:36] <Emperor>	 !log repool ms-fe1014 with new envoy TLS setup T317616
[16:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:41] <stashbot>	 T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616
[16:56:44] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 13 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah)
[16:56:48] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:20] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage
[16:57:26] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1113.eqiad.wmnet
[16:57:27] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1113.eqiad.wmnet
[16:57:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon)
[16:59:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:59:31] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801
[17:01:04] <fabfur>	 !log swapped cp1113 <-> cp1088 (T349244)
[17:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:16] <stashbot>	 T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244
[17:02:04] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage
[17:02:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801 (owner: 10Hnowlan)
[17:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: correct config mount path [deployment-charts] - 10https://gerrit.wikimedia.org/r/976801 (owner: 10Hnowlan)
[17:06:53] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[17:07:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[17:10:58] <wikibugs>	 (03CR) 10Pppery: "Noted. I'll update this patch (and the related one elsewhere in the tree that updates the files actually read by Phabricator) then" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery)
[17:11:59] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/976385 (https://phabricator.wikimedia.org/T332589) (owner: 10Stevemunene)
[17:21:19] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[17:21:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[17:23:34] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[17:23:35] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2004.codfw.wmnet with OS bullseye
[17:23:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[17:24:45] <wikibugs>	 (03PS1) 10Samtar: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814
[17:24:51] <wikibugs>	 (03PS2) 10Samtar: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814
[17:25:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[17:25:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[17:25:15] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:17] <TheresNoTime>	 jouncebot: nowandnext
[17:25:17] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 34 minute(s)
[17:25:17] <jouncebot>	 In 0 hour(s) and 34 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800)
[17:26:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 (owner: 10Samtar)
[17:26:23] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2004.codfw.wmnet
[17:26:23] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2004.codfw.wmnet
[17:26:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "InitialiseSettings-labs: Enable AbuseFilterBlockedExternalDomainsNotifications on enwiki.beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976814 (owner: 10Samtar)
[17:27:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[17:27:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[17:27:36] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2005.codfw.wmnet with OS bullseye
[17:27:44] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1175.mgmt.eqiad.wmnet with reboot policy FORCED
[17:28:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[17:28:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[17:29:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1175']
[17:30:37] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:36:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1175']
[17:37:24] <wikibugs>	 (03CR) 10Dreamy Jazz: Reapply "Enable LoginNotify seen subnets table"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976177 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling)
[17:39:37] <icinga-wm>	 PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 10021 MB (4% inode=66%): /tmp 10021 MB (4% inode=66%): /var/tmp 10021 MB (4% inode=66%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[17:42:50] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage
[17:44:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[17:45:29] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage
[17:45:55] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 (10Clement_Goubert) 05In progress→03Resolved a:03Clement_Goubert
[17:51:03] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976804 (https://phabricator.wikimedia.org/T308142)
[17:55:27] <wikibugs>	 (03PS1) 10Fabfur: conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244)
[17:58:41] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:30] <wikibugs>	 (03PS2) 10Fabfur: conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1800)
[18:02:29] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[18:03:16] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] conftool-data: (temporary) remove cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976805 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[18:03:45] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2005.codfw.wmnet with OS bullseye
[18:04:51] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:06:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:51] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:06:59] <wikibugs>	 (03PS1) 10Fabfur: conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244)
[18:07:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[18:12:39] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:15:01] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] conftool-data: re-added cp1113 [puppet] - 10https://gerrit.wikimedia.org/r/976826 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[18:16:15] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 337 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:16:34] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1160.eqiad.wmnet with OS bullseye
[18:16:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye
[18:17:17] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 53 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:18:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:19:46] <wikibugs>	 (03CR) 10Dzahn: "could this have caused https://phabricator.wikimedia.org/T351832 ?" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[18:20:21] <icinga-wm>	 PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 10800 MB (4% inode=65%): /tmp 10800 MB (4% inode=65%): /var/tmp 10800 MB (4% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[18:21:53] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 80 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:22:37] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 19 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:25:55] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10RobH) attempting to reimage an-worker1160 it sticks at requesting a lease for boot, host shows the MAC of the eth0 attempting to request a dhcp lease for boot.  on insta...
[18:29:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:29:45] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1152 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:32:29] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1152 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:44] <wikibugs>	 (03PS1) 10Dzahn: doc: move rsync auth secrets to new location to unbreak puppet [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832)
[18:37:36] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "thanks, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832) (owner: 10Dzahn)
[18:38:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: move rsync auth secrets to new location to unbreak puppet [puppet] - 10https://gerrit.wikimedia.org/r/976830 (https://phabricator.wikimedia.org/T351832) (owner: 10Dzahn)
[18:40:43] <icinga-wm>	 RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[18:48:31] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:49:35] <wikibugs>	 (03CR) 10Dzahn: "fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/976830" [puppet] - 10https://gerrit.wikimedia.org/r/976284 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[18:50:59] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:55:36] <wikibugs>	 (03PS1) 10Majavah: rsync: do not included config for absented modules [puppet] - 10https://gerrit.wikimedia.org/r/976835
[18:56:32] <wikibugs>	 (03PS27) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[18:59:46] <wikibugs>	 (03CR) 10Muehlenhoff: Initial checkin of community_civicrm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:00:06] <jouncebot>	 Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T1900)
[19:03:23] <wikibugs>	 (03PS1) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353)
[19:04:02] <wikibugs>	 (03PS2) 10DDesouza: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393)
[19:04:19] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 41 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:04:29] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 141 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:05:30] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2005.codfw.wmnet
[19:05:31] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2005.codfw.wmnet
[19:06:35] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2006.codfw.wmnet with OS bullseye
[19:09:59] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 63 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:14:39] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:03] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 24 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:18:57] <wikibugs>	 (03PS2) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353)
[19:22:57] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage
[19:23:11] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 43 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:24:50] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:25:23] <wikibugs>	 (03PS1) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353)
[19:25:37] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage
[19:26:17] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10brouberol) I wonder if something as simple as round robin DNS implemented with multiple A records with the same subdomain would suffice  to substantially improve the situation.  In...
[19:28:33] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 21 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:30:46] <wikibugs>	 (03PS1) 10DDesouza: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464)
[19:33:51] <wikibugs>	 (03PS1) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353)
[19:36:41] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 49 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:36:51] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1160.eqiad.wmnet with OS bullseye
[19:36:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye executed with errors: - an-worker1...
[19:42:03] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 19 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[19:42:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/976835 (owner: 10Majavah)
[19:44:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53732 and previous config saved to /var/cache/conftool/dbconfig/20231122-194428-arnaudb.json
[19:44:36] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:47:45] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:48:08] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2006.codfw.wmnet with OS bullseye
[19:55:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2006.codfw.wmnet
[19:55:27] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2006.codfw.wmnet
[19:56:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2007.codfw.wmnet with OS bullseye
[19:59:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53733 and previous config saved to /var/cache/conftool/dbconfig/20231122-195934-arnaudb.json
[20:09:55] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10wiki_willy) a:03Jclark-ctr
[20:10:46] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-06-172159 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971999 (https://phabricator.wikimedia.org/T297509)
[20:10:48] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976846 (https://phabricator.wikimedia.org/T349385)
[20:10:55] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Switch Python evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976847 (https://phabricator.wikimedia.org/T281500)
[20:11:08] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Switch orchestrator to 2023-11-17-200241 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976848 (https://phabricator.wikimedia.org/T297509)
[20:11:33] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage
[20:13:03] <icinga-wm>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 62 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:14:14] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage
[20:14:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P53734 and previous config saved to /var/cache/conftool/dbconfig/20231122-201441-arnaudb.json
[20:15:21] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:23] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:19:43] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:27:11] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:27:13] <wikibugs>	 (03PS2) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353)
[20:28:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:29:13] <icinga-wm>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 26 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:29:29] <wikibugs>	 (03PS3) 10DDesouza: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353)
[20:29:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T348183)', diff saved to https://phabricator.wikimedia.org/P53735 and previous config saved to /var/cache/conftool/dbconfig/20231122-202947-arnaudb.json
[20:29:52] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:30:05] <wikibugs>	 (03PS2) 10DDesouza: Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976844 (https://phabricator.wikimedia.org/T351353)
[20:33:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:54] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2007.codfw.wmnet with OS bullseye
[20:34:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:34:59] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:35:10] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2007.codfw.wmnet
[20:35:11] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2007.codfw.wmnet
[20:35:48] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2008.codfw.wmnet with OS bullseye
[20:36:29] <jinxer-wm>	 (PuppetZeroResources) firing: (8) Puppet has failed generate resources on mw1472:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[20:37:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:37:45] <wikibugs>	 (03PS3) 10DDesouza: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353)
[20:38:01] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn)
[20:41:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn)
[20:41:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn)
[20:41:53] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: fix java_start_mem in template [puppet] - 10https://gerrit.wikimedia.org/r/976799 (owner: 10JHathaway)
[20:43:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) ` cookbook [GLOBAL_ARGS] sre.ganeti.makevm: error: argument --memory: Memory must be at least 1.5G  `  Oh really? Well then 1.5G. But we used to have VMs with 256MB, didnt we
[20:43:54] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[20:44:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) ` sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5G ... .. error: argument --memory: invalid validate_memory value: '1.5G' `   ` sudo cookbook sre.ganeti.makevm --vc...
[20:45:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1175.eqiad.wmnet with OS bullseye
[20:45:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye
[20:47:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host planet1003.eqiad.wmnet
[20:47:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[20:50:04] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1003.eqiad.wmnet - dzahn@cumin1001"
[20:50:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1003.eqiad.wmnet - dzahn@cumin1001"
[20:50:51] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:50:52] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache planet1003.eqiad.wmnet on all recursors
[20:50:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) planet1003.eqiad.wmnet on all recursors
[20:51:29] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1003.eqiad.wmnet - dzahn@cumin1001"
[20:52:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1003.eqiad.wmnet - dzahn@cumin1001"
[20:53:46] <wikibugs>	 (03PS1) 10Dzahn: site: add planet[12]003 to role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392)
[20:53:48] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/975832 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[20:53:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:58:09] <wikibugs>	 (03PS1) 10JHathaway: g10k: spelling [puppet] - 10https://gerrit.wikimedia.org/r/976856
[20:58:11] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: use a symlink to swap in new code [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809)
[20:58:35] <wikibugs>	 (03PS1) 10Dzahn: site: add planet[12]003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/976858 (https://phabricator.wikimedia.org/T351849)
[20:58:57] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[20:59:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add planet[12]003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/976858 (https://phabricator.wikimedia.org/T351849) (owner: 10Dzahn)
[20:59:05] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] g10k: spelling [puppet] - 10https://gerrit.wikimedia.org/r/976856 (owner: 10JHathaway)
[21:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T2100).
[21:00:06] <jouncebot>	 danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:27] <danisztls>	 o/
[21:00:41] <RoanKattouw>	 I can deploy
[21:00:47] <wikibugs>	 (03PS2) 10Dzahn: site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392)
[21:00:53] <wikibugs>	 (03PS3) 10Dzahn: site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392)
[21:01:18] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn)
[21:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza)
[21:02:49] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976325 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza)
[21:02:56] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[21:03:06] <logmsgbot>	 !log catrope@deploy2002 Started scap: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]]
[21:03:13] <stashbot>	 T344393: Quicksurvey deployment for readers survey  - https://phabricator.wikimedia.org/T344393
[21:03:15] <wikibugs>	 (03PS28) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[21:04:28] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage
[21:07:21] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm
[21:07:25] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage
[21:07:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm
[21:11:43] <logmsgbot>	 !log catrope@deploy2002 catrope and dani: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:12:02] <stashbot>	 T344393: Quicksurvey deployment for readers survey  - https://phabricator.wikimedia.org/T344393
[21:12:21] <RoanKattouw>	 danisztls: Your first change (undeploy Reader Demographics 2 on enwiki) is now ready for testing on the test servers, please test and ping me when you've confirmed it works
[21:14:14] <danisztls>	 RoanKattouw: looks good
[21:16:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage
[21:18:47] <logmsgbot>	 !log catrope@deploy2002 catrope and dani: Continuing with sync
[21:19:05] <wikibugs>	 (03PS1) 10JHathaway: apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830)
[21:19:42] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[21:19:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage
[21:22:15] <wikibugs>	 (03PS1) 10Gergő Tisza: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685)
[21:22:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza)
[21:24:44] <wikibugs>	 (03PS29) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[21:24:48] <wikibugs>	 (03PS2) 10Gergő Tisza: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685)
[21:24:50] <logmsgbot>	 !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976325|Undeploy Reader Demographics 2 survey on enwiki (T344393)]] (duration: 21m 43s)
[21:24:54] <stashbot>	 T344393: Quicksurvey deployment for readers survey  - https://phabricator.wikimedia.org/T344393
[21:25:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:26:21] <RoanKattouw>	 Alright the Reader Demographics change is deployed, the Core Metrics one is next
[21:26:57] <wikibugs>	 (03PS4) 10Catrope: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:27:01] <danisztls>	 RoanKattouw: thanks!
[21:27:02] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by catrope@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:27:14] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:27:53] <wikibugs>	 (03Merged) 10jenkins-bot: Update Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976839 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:28:10] <logmsgbot>	 !log catrope@deploy2002 Started scap: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]]
[21:28:15] <stashbot>	 T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353
[21:28:53] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2008.codfw.wmnet with OS bullseye
[21:29:28] <logmsgbot>	 !log catrope@deploy2002 catrope and dani: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:29:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[21:29:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for aqs2008.codfw.wmnet
[21:29:45] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs2008.codfw.wmnet
[21:29:54] <RoanKattouw>	 danisztls: The Core Metrics patch is on the test servers, please test
[21:30:04] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] apt-staging: unbreak rsync puppetry [puppet] - 10https://gerrit.wikimedia.org/r/976863 (https://phabricator.wikimedia.org/T345830) (owner: 10JHathaway)
[21:30:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2009.codfw.wmnet with OS bullseye
[21:30:39] <danisztls>	 RoanKattouw: looks good
[21:30:55] <logmsgbot>	 !log catrope@deploy2002 catrope and dani: Continuing with sync
[21:31:16] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[21:34:22] <wikibugs>	 (03CR) 10Dwisehaupt: "Thanks for the suggestions. Updates made and changeset rebased to pull in the lasted repo updates." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[21:37:15] <logmsgbot>	 !log catrope@deploy2002 Finished scap: Backport for [[gerrit:976839|Update Annual Plan Core Metrics survey (T351353)]] (duration: 09m 04s)
[21:37:20] <stashbot>	 T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353
[21:37:43] <wikibugs>	 (03PS4) 10Catrope: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:37:48] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:37:59] <wikibugs>	 (03PS2) 10Catrope: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[21:38:03] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[21:39:03] <danisztls>	 RoanKattouw: thanks!
[21:39:08] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976842 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:39:16] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Clean residuals from Research Incentive survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976843 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[21:39:37] <RoanKattouw>	 danisztls: Now that these beta patches are merged, there's no manual deployment process, they're automatically deployed to beta labs but it can take ~15 minutes
[21:42:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[21:44:00] <danisztls>	 RoanKattouw: no problem, I will check them later
[21:44:07] <danisztls>	 thanks, again!
[21:44:12] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage
[21:44:21] <RoanKattouw>	 Great! And I think that's everything for today
[21:46:44] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage
[21:56:54] <wikibugs>	 (03PS1) 10Dzahn: hieradata: set planet[12]003 to use puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976867 (https://phabricator.wikimedia.org/T351849)
[22:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231122T2200)
[22:00:05] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[22:00:09] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikimedia.is has 86391 seconds left https://wikitech.wikimedia.org/wiki/Ncredir
[22:00:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hieradata: set planet[12]003 to use puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976867 (https://phabricator.wikimedia.org/T351849) (owner: 10Dzahn)
[22:00:43] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir4001 is CRITICAL: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has 86356 seconds left https://wikitech.wikimedia.org/wiki/Ncredir
[22:04:16] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868
[22:05:32] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2009.codfw.wmnet with OS bullseye
[22:06:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye
[22:07:13] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1175.eqiad.wmnet with OS bullseye
[22:07:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1175.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[22:08:47] <logmsgbot>	 !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host planet1003.eqiad.wmnet with OS bookworm
[22:08:47] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host planet1003.eqiad.wmnet
[22:08:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests, 10Patch-For-Review: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors: -...
[22:08:59] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[22:09:37] <mutante>	 jhathaway: ^ something went wrong with the patch maybe?
[22:10:18] <jhathaway>	 hmm, strange, a manual puppet run was successful, let me check, thanks mutante 
[22:10:29] <mutante>	 is the 2001 vs 1001?
[22:11:15] <jhathaway>	 there is only 2001, to my knowledge
[22:11:29] <jhathaway>	 also it doesn't show up on the alerts dashboard, hmm
[22:11:48] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm
[22:11:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm
[22:16:02] <moritzm>	 apt-staging is a single host,  more of an initial PoC which will either get extended to with a second staging host or folded into the main apt servers, TBD
[22:16:10] <moritzm>	 it's rsync endpoints are the gitlab runners
[22:17:09] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868 (owner: 10Ebernhardson)
[22:17:29] <jhathaway>	 nod, thanks moritzm 
[22:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/976868 (owner: 10Ebernhardson)
[22:18:27] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:18:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:18:43] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:19:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:20:18] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2010.codfw.wmnet with OS bullseye
[22:20:34] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye
[22:21:36] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage
[22:23:43] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: use a symlink to swap in new code [puppet] - 10https://gerrit.wikimedia.org/r/976857 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[22:24:13] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage
[22:25:40] <ebernhardson>	 !log start cirrus updater backfilling into relforge
[22:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:11] <wikibugs>	 (03PS1) 10JHathaway: puppet-merge: test, no changes [puppet] - 10https://gerrit.wikimedia.org/r/976871
[22:33:09] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppet-merge: test, no changes [puppet] - 10https://gerrit.wikimedia.org/r/976871 (owner: 10JHathaway)
[22:34:29] <mutante>	 !log puppetserver1001 - manually signed puppet cert request for planet1003
[22:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host planet2003.codfw.wmnet
[22:35:28] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[22:38:59] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet2003.codfw.wmnet - dzahn@cumin1001"
[22:40:19] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet2003.codfw.wmnet - dzahn@cumin1001"
[22:40:19] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:40:19] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache planet2003.codfw.wmnet on all recursors
[22:40:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) planet2003.codfw.wmnet on all recursors
[22:40:49] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet2003.codfw.wmnet - dzahn@cumin1001"
[22:41:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet2003.codfw.wmnet - dzahn@cumin1001"
[22:41:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet2003.codfw.wmnet with OS bookworm
[22:42:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm
[22:43:01] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs2010.codfw.wmnet with OS bullseye
[22:43:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye
[22:49:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[22:52:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ganeti servers in codfw - jhancock@cumin2002"
[22:53:53] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating ganeti servers in codfw - jhancock@cumin2002"
[22:53:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:57:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage
[22:57:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 44m 37s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[22:58:08] <wikibugs>	 (03PS1) 10JHathaway: Revert "puppet-merge: test, no changes" [puppet] - 10https://gerrit.wikimedia.org/r/976818
[23:00:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage
[23:01:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Revert "puppet-merge: test, no changes" [puppet] - 10https://gerrit.wikimedia.org/r/976818 (owner: 10JHathaway)
[23:02:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 7h 30m 48s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[23:02:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2028.mgmt.codfw.wmnet with reboot policy FORCED
[23:02:47] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage
[23:05:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet2003.codfw.wmnet with reason: host reimage
[23:06:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 14m 24s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[23:09:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2029.mgmt.codfw.wmnet with reboot policy FORCED
[23:10:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2030.mgmt.codfw.wmnet with reboot policy FORCED
[23:10:54] <icinga-wm>	 PROBLEM - Check systemd state on logstash2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:34] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchElasticaWrite lag is too high: 6h 40m 25s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchElasticaWrite - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[23:13:05] <wikibugs>	 (03PS4) 10JHathaway: dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970)
[23:13:20] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[23:13:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2028.mgmt.codfw.wmnet with reboot policy FORCED
[23:15:02] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:40] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:02] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2031.mgmt.codfw.wmnet with reboot policy FORCED
[23:18:38] <icinga-wm>	 PROBLEM - Check systemd state on bast2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:21:12] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:21:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2029.mgmt.codfw.wmnet with reboot policy FORCED
[23:21:36] <icinga-wm>	 PROBLEM - Check systemd state on krb2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:22:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2030.mgmt.codfw.wmnet with reboot policy FORCED
[23:22:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2032.mgmt.codfw.wmnet with reboot policy FORCED
[23:22:54] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2062 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:02] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2052 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED
[23:26:07] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2010.codfw.wmnet with OS bullseye
[23:26:10] <icinga-wm>	 PROBLEM - Check systemd state on ganeti-test1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED
[23:26:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2032 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:28] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED
[23:27:50] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1039 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:23] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:29:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2031.mgmt.codfw.wmnet with reboot policy FORCED
[23:30:12] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1017 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:22] <icinga-wm>	 PROBLEM - Check systemd state on kafka-main2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:24] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:48] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2034.mgmt.codfw.wmnet with reboot policy FORCED
[23:32:50] <wikibugs>	 (03PS30) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[23:32:56] <icinga-wm>	 PROBLEM - Check systemd state on kubestage1004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:18] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:18] <icinga-wm>	 PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[23:33:20] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:54] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:08] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2032.mgmt.codfw.wmnet with reboot policy FORCED
[23:34:26] <icinga-wm>	 PROBLEM - Check systemd state on kafka-main2005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:34:26] <icinga-wm>	 PROBLEM - Check systemd state on ganeti4005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:32] <icinga-wm>	 PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:36] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:46] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:12] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2035.mgmt.codfw.wmnet with reboot policy FORCED
[23:36:14] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:32] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:42] <icinga-wm>	 PROBLEM - Check systemd state on sessionstore2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:48] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:52] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:52] <icinga-wm>	 PROBLEM - Check systemd state on backup2007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:37:08] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:37:26] <icinga-wm>	 PROBLEM - Check systemd state on ml-cache1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:12] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:20] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:22] <icinga-wm>	 PROBLEM - Check systemd state on ganeti-test2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:42] <jinxer-wm>	 (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:39:06] <icinga-wm>	 PROBLEM - Check systemd state on ganeti-test1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:36] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:39:42] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:40:56] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:41:04] <icinga-wm>	 PROBLEM - Check systemd state on ganeti5007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:41:04] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:41:10] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1109 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:14] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1134 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:26] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1012 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:43:46] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:52] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:56] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2034.mgmt.codfw.wmnet with reboot policy FORCED
[23:44:22] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1015 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:34] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:52] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:02] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED
[23:45:04] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:04] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1023 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:39] <wikibugs>	 (03PS31) 10Dwisehaupt: Initial checkin of community_civicrm module [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486)
[23:45:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:56] <icinga-wm>	 PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2035.mgmt.codfw.wmnet with reboot policy FORCED
[23:47:14] <icinga-wm>	 PROBLEM - Check systemd state on ms-backup2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1024 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2035 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:02] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1021 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:54] <icinga-wm>	 PROBLEM - Check systemd state on backup2006 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:26] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028']
[23:50:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2028']
[23:50:08] <icinga-wm>	 PROBLEM - Check systemd state on cp4037 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028']
[23:50:38] <wikibugs>	 (03CR) 10Dwisehaupt: "Minor changes to not show diff with db password on the grants file. And update grants to the current grants used." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[23:50:50] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2028']
[23:50:58] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1045 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:50:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028']
[23:51:04] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2046 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2028']
[23:51:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2028']
[23:51:50] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2028']
[23:52:04] <icinga-wm>	 PROBLEM - Check systemd state on an-conf1003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:52:50] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:53:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2029']
[23:53:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2029']
[23:54:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2029']
[23:54:26] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2029']
[23:54:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2030']
[23:54:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2033.mgmt.codfw.wmnet with reboot policy FORCED
[23:55:15] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2030']
[23:55:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2031']
[23:55:59] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2031']
[23:56:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2032']
[23:56:37] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2032']
[23:56:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2033']
[23:57:42] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase2033']
[23:58:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2034']
[23:58:37] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2034']
[23:58:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2035']
[23:59:09] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['restbase2035']