[00:00:18] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:03:35] <jinxer-wm>	 (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 8h 53m 33s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh
[00:03:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:26:09] <wikibugs>	 (03PS4) 10Cwhite: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[00:27:26] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye
[00:32:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[00:38:09] <wikibugs>	 (03CR) 10Cwhite: "Tests appear unhappy because aiohttp is missing?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[00:39:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631
[00:39:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631 (owner: 10TrainBranchBot)
[00:44:39] <wikibugs>	 (03CR) 10Cwhite: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro)
[00:58:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974631 (owner: 10TrainBranchBot)
[01:06:10] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:09:27] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:38:56] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:56:36] <icinga-wm>	 PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100%
[03:04:30] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:08:56] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm
[03:40:58] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[03:44:01] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2001-dev.codfw.wmnet with reason: host reimage
[03:54:36] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:58:56] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:58:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:24:57] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye
[04:50:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T348183)', diff saved to https://phabricator.wikimedia.org/P53495 and previous config saved to /var/cache/conftool/dbconfig/20231116-045035-arnaudb.json
[04:50:40] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[04:57:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm
[05:05:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P53496 and previous config saved to /var/cache/conftool/dbconfig/20231116-050542-arnaudb.json
[05:09:51] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:10:26] <wikibugs>	 (03PS1) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[05:12:17] <wikibugs>	 (03PS2) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[05:13:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:15:21] <wikibugs>	 (03PS3) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[05:18:18] <wikibugs>	 (03PS4) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[05:20:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P53497 and previous config saved to /var/cache/conftool/dbconfig/20231116-052048-arnaudb.json
[05:21:25] <wikibugs>	 (03PS5) 10Pppery: Clean up a bunch of things [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694)
[05:26:55] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:29:29] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:35:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T348183)', diff saved to https://phabricator.wikimedia.org/P53498 and previous config saved to /var/cache/conftool/dbconfig/20231116-053554-arnaudb.json
[05:35:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[05:36:00] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[05:36:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[05:36:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53499 and previous config saved to /var/cache/conftool/dbconfig/20231116-053616-arnaudb.json
[05:48:10] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.wikimedia.org with OS bullseye
[05:54:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[05:58:21] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:58:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:59:28] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:07:17] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2004.codfw.wmnet with OS bullseye
[06:07:24] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye...
[06:07:25] <icinga-wm>	 RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms
[06:15:01] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:17:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:18:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:29:00] <wikibugs>	 (03PS1) 10Marostegui: db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974721 (https://phabricator.wikimedia.org/T351176)
[06:29:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974721 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui)
[06:29:40] <jinxer-wm>	 (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[06:30:15] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:30:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch
[06:30:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch
[06:33:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:34:40] <jinxer-wm>	 (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[06:37:11] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:41:24] <wikibugs>	 (03PS2) 10KartikMistry: testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915)
[06:44:05] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:11] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:33] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176)
[06:55:05] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:55:13] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:58:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:08:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 (10Marostegui) This is ready for #dc-ops
[07:16:08] <wikibugs>	 (03PS1) 10Urbanecm: mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695)
[07:18:26] <wikibugs>	 (03PS1) 10Marostegui: report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367)
[07:19:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:19:36] <wikibugs>	 (03Merged) 10jenkins-bot: report_users: Remove dbproxy1011 address [software] - 10https://gerrit.wikimedia.org/r/974727 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui)
[07:22:18] <wikibugs>	 (03PS1) 10Urbanecm: IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695)
[07:22:27] <urbanecm>	 jouncebot: nowandnext
[07:22:27] <jouncebot>	 For the next 0 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0700)
[07:22:28] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0700)
[07:22:28] <jouncebot>	 In 0 hour(s) and 37 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0800)
[07:26:29] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:28:56] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:30:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: prometheus::pop
[07:31:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch prometheus::pop to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974828 (https://phabricator.wikimedia.org/T349619)
[07:33:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus::pop to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974828 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:37:52] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:40:02] <icinga-wm>	 RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:40:19] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui)
[07:42:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: prometheus::pop
[07:43:24] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:57] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[07:48:04] <icinga-wm>	 RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:49] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/974868 (https://phabricator.wikimedia.org/T351285)
[07:50:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/974868 (https://phabricator.wikimedia.org/T351285) (owner: 10Marostegui)
[07:51:48] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:53:04] <icinga-wm>	 RECOVERY - Check systemd state on dbstore1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ncredir4001.ulsfo.wmnet
[07:55:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ncredir4001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974870 (https://phabricator.wikimedia.org/T349619)
[07:56:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch ncredir4001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974870 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:56:58] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:58:56] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:03:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ncredir4001.ulsfo.wmnet
[08:07:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host prometheus2006.codfw.wmnet
[08:09:07] <moritzm>	 !log installing python-git security updates
[08:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch prometheus2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974908 (https://phabricator.wikimedia.org/T349619)
[08:12:33] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host cloudcumin2001.codfw.wmnet
[08:12:54] <wikibugs>	 (03PS1) 10Majavah: hieradata: migrate cloudcumin2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974923
[08:13:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: migrate cloudcumin2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974923 (owner: 10Majavah)
[08:14:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974908 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:17:01] <moritzm>	 !log installing elfutils security updates
[08:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:09] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcumin2001.codfw.wmnet
[08:18:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:18:46] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:19:01] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host clouddumps1001.wikimedia.org
[08:19:10] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:19:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:19:24] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:19:37] <wikibugs>	 (03PS1) 10Majavah: hieradata: migrate clouddumps1001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974924
[08:20:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:20:24] <wikibugs>	 (03PS1) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343)
[08:20:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: migrate clouddumps1001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974924 (owner: 10Majavah)
[08:20:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:21:12] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:21:36] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:21:38] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:21:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:21:52] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:21:57] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/504/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:22:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host prometheus2006.codfw.wmnet
[08:23:28] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:25:06] <wikibugs>	 (03CR) 10Slyngshede: [C: 04-1] "That, not actually enough, you could still use the IP." [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:25:38] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host clouddumps1001.wikimedia.org
[08:30:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: thanos::frontend
[08:31:47] <wikibugs>	 (03PS7) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412)
[08:31:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:31:59] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar)
[08:32:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch thanos::frontend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974927 (https://phabricator.wikimedia.org/T349619)
[08:34:14] <moritzm>	 !log installing ruby-rails-html-sanitizer security updates
[08:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:22] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:34:42] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) (owner: 10Tchanders)
[08:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) (owner: 10Tchanders)
[08:36:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos::frontend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974927 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:36:29] <wikibugs>	 (03PS1) 10Brouberol: Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059)
[08:36:50] <wikibugs>	 (03PS2) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343)
[08:37:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[08:37:29] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[08:37:54] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[08:38:09] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/505/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:38:24] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] Automatically generate autoinstall subnet DHCP config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[08:38:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:39:34] <jinxer-wm>	 (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[08:39:40] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:39:49] <Kizule>	 Hi, as we are still in train window time, can we deploy https://phabricator.wikimedia.org/T351048?
[08:40:21] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "While I'm not loving this, it does work. Obviously the IDM is differently than other Apache2 based application, so the correct permanent s" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:41:43] <wikibugs>	 (03CR) 10Majavah: P:idm add fqdn for the host as an Apache server alias. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:42:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: thanos::frontend
[08:42:35] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:idm add fqdn for the host as an Apache server alias. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[08:42:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "This should have a changelog entry" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[08:43:16] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:44:26] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:44:34] <jinxer-wm>	 (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[08:48:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:49:52] <wikibugs>	 (03CR) 10Arnaudb: mariadb: clone and upgrade mariadb (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:50:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:57:46] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:58:04] <icinga-wm>	 PROBLEM - Check systemd state on ml-cache2003 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::worker
[09:00:07] <jouncebot>	 jeena and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T0900).
[09:00:27] <godog>	 !log bounce prometheus instances on prometheus2006 to test p7 upgrade
[09:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:45] <jinxer-wm>	 (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[09:02:16] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:03:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kubernetes::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974931 (https://phabricator.wikimedia.org/T349619)
[09:03:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:05:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "profile::pyrra::filesystem: improve/fix lift wing pilot" [puppet] - 10https://gerrit.wikimedia.org/r/974241
[09:05:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::worker to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974931 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:06:45] <jinxer-wm>	 (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[09:07:56] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:08:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "profile::pyrra::filesystem: improve/fix lift wing pilot" [puppet] - 10https://gerrit.wikimedia.org/r/974241 (owner: 10Filippo Giunchedi)
[09:08:46] <wikibugs>	 (03PS3) 10Slyngshede: P:idm add fqdn for the host as an Apache server alias. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343)
[09:09:25] <wikibugs>	 (03PS1) 10Abijeet Patro: TranslatablePageMarker: Add patrol status for translatable page [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273)
[09:09:28] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 5%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53500 and previous config saved to /var/cache/conftool/dbconfig/20231116-090955-arnaudb.json
[09:10:00] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/506/con" [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[09:10:04] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:29] <wikibugs>	 (03PS4) 10Slyngshede: P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343)
[09:13:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:14:10] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:14:22] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Limit Envoy proxing for idm-test. [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343)
[09:14:31] <jinxer-wm>	 (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[09:14:40] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:16:25] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036)
[09:16:37] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/507/con" [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[09:16:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:17:33] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: re-enable notifications for db1238 [puppet] - 10https://gerrit.wikimedia.org/r/974632 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:18:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:19:02] <wikibugs>	 (03CR) 10Slyngshede: "Perhaps best to split test and prod, and run test first." [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[09:19:31] <jinxer-wm>	 (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[09:25:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53501 and previous config saved to /var/cache/conftool/dbconfig/20231116-092500-arnaudb.json
[09:28:03] <wikibugs>	 (03Abandoned) 10Hashar: Initial checkin. [software/charon] - 10https://gerrit.wikimedia.org/r/838127 (owner: 10Slyngshede)
[09:29:46] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935
[09:30:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935 (owner: 10Volans)
[09:33:20] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036)
[09:34:26] <wikibugs>	 (03PS2) 10Fabfur: haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609)
[09:35:26] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:35:41] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/974935 (owner: 10Volans)
[09:37:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "ulsfo looking good per pybal logs.. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur)
[09:38:07] <wikibugs>	 (03PS1) 10Volans: Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937
[09:38:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/974722 (https://phabricator.wikimedia.org/T351176) (owner: 10Marostegui)
[09:38:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans)
[09:38:57] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:39:28] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:40:04] <wikibugs>	 (03PS2) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033)
[09:40:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 15%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53502 and previous config saved to /var/cache/conftool/dbconfig/20231116-094005-arnaudb.json
[09:40:32] <wikibugs>	 (03CR) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[09:40:48] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:43:49] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur)
[09:45:39] <wikibugs>	 (03PS2) 10D3r1ck01: wmf-config: Introduce setting for "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004)
[09:45:45] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/974268 to all cp hosts everywhere (setting maxconn on varnish to 20k) T310609
[09:45:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans)
[09:47:17] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v8.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/974937 (owner: 10Volans)
[09:47:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[09:48:29] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:48:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[09:50:55] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500)
[09:51:56] <volans>	 !log uploaded spicerack_8.0.3 to apt.wikimedia.org bullseye-wikimedia
[09:51:59] <volans>	 moritzm: FYI ^^^
[09:52:11] <volans>	 I will deploy it shortly to the cumin hosts
[09:53:24] <moritzm>	 excellent, thanks
[09:54:24] <icinga-wm>	 RECOVERY - Check systemd state on ml-cache2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:26] <wikibugs>	 (03PS1) 10MVernon: admin: add hghani, osefu to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130)
[09:55:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 30%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53503 and previous config saved to /var/cache/conftool/dbconfig/20231116-095510-arnaudb.json
[09:56:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::worker
[09:59:28] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:01:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) (owner: 10Btullis)
[10:01:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: thanos::backend
[10:01:36] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host sretest2003.mgmt.codfw.wmnet with reboot policy FORCED
[10:02:03] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:03:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[10:03:19] <godog>	 !log bounce thanos components on titan1001
[10:03:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619)
[10:03:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] swift: migrate one node to envoy for TLS termination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[10:03:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:04:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619)
[10:05:27] <jynus>	 !log stopping bacula on backup1001
[10:05:31] <wikibugs>	 (03PS2) 10Hnowlan: service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796)
[10:05:56] <jynus>	 prometheus job runner for backup1001 will complain for a bit
[10:06:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos::backend to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974941 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:08:20] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942
[10:08:30] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.wikimedia.org with OS bullseye
[10:08:36] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:09:45] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede)
[10:10:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 45%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53504 and previous config saved to /var/cache/conftool/dbconfig/20231116-101015-arnaudb.json
[10:11:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130) (owner: 10MVernon)
[10:11:33] <wikibugs>	 (03Merged) 10jenkins-bot: NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede)
[10:12:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: thanos::backend
[10:12:46] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add hghani, osefu to analytics-product-users [puppet] - 10https://gerrit.wikimedia.org/r/974940 (https://phabricator.wikimedia.org/T351130) (owner: 10MVernon)
[10:13:04] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[10:13:57] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:15:05] <jynus>	 ^that's me
[10:17:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:17:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is now done (but allow a little while for puppet to do its thing).
[10:20:43] <marostegui>	 !log Failover m1 from db1119 to db1164 - T351176
[10:21:06] <marostegui>	 done
[10:21:35] <hashar>	  /clear
[10:21:48] <volans>	 !log installer spicerack v8.0.3 on the cumin hosts
[10:21:51] <marostegui>	 etherpad works fine
[10:22:26] <jynus>	 marostegui: there is one puppet thing I think I saw 
[10:22:35] <marostegui>	 jynus: which one?
[10:22:36] <jynus>	 shouldn't it say 10.6 on hiera?
[10:22:43] <jynus>	 maybe I am wrong
[10:22:54] <marostegui>	 good point!
[10:22:54] <jynus>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/974722/2/hieradata/hosts/db1164.yaml
[10:23:04] <marostegui>	 fixing
[10:23:23] <marostegui>	 jynus: this can be merged too: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973804
[10:23:39] <jynus>	 if you are asking, yes, please do when you can
[10:24:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui)
[10:24:10] <wikibugs>	 (03PS1) 10Marostegui: db1164: Add mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/974945
[10:24:23] <jynus>	 let me know when finished to run puppet and restart bacula
[10:24:28] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage
[10:24:36] <marostegui>	 jynus: it is merged now
[10:24:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Add mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/974945 (owner: 10Marostegui)
[10:24:54] <jynus>	 !log reenabling puppet and starting bacula on backup1001
[10:25:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 60%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53505 and previous config saved to /var/cache/conftool/dbconfig/20231116-102520-arnaudb.json
[10:27:26] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.wikimedia.org with reason: host reimage
[10:28:57] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:29:06] <jynus>	 ^ everything looking good
[10:29:13] <marostegui>	 no pki alerts today
[10:32:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::master
[10:33:43] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: kubernetes::master
[10:36:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:36:12] <wikibugs>	 (03PS1) 10Hnowlan: cassandra: remove references to graphite [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193)
[10:36:30] <godog>	 btullis: 3 druid hosts are out of space on /, known ?
[10:37:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on kubernetes1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:39:08] <wikibugs>	 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10ehughes)
[10:39:12] <godog>	 stevemunene: ^ re: druid out of space
[10:39:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:40:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host kubemaster1002.eqiad.wmnet
[10:40:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53506 and previous config saved to /var/cache/conftool/dbconfig/20231116-104025-arnaudb.json
[10:40:50] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:40:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947
[10:41:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kubemaster1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974948 (https://phabricator.wikimedia.org/T349619)
[10:43:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch kubemaster1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974948 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:44:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond)
[10:46:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond)
[10:47:02] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet
[10:47:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet
[10:49:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kubemaster1002.eqiad.wmnet
[10:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: add acmechief_host hiera data [cookbooks] - 10https://gerrit.wikimedia.org/r/974942 (owner: 10Jbond)
[10:52:24] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.wikimedia.org with OS bullseye
[10:52:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: syslog::centralserver
[10:52:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on kubernetes1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:53:26] <icinga-wm>	 PROBLEM - Check systemd state on titan1001 is CRITICAL: CRITICAL - degraded: The following units failed: pyrra-generate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 90%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53507 and previous config saved to /var/cache/conftool/dbconfig/20231116-105530-arnaudb.json
[10:56:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[10:58:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch syslog::centralserver to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974952 (https://phabricator.wikimedia.org/T349619)
[10:58:47] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:59:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch syslog::centralserver to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974952 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:59:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53508 and previous config saved to /var/cache/conftool/dbconfig/20231116-105930-arnaudb.json
[11:00:06] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1100).
[11:00:07] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1100)
[11:03:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: syslog::centralserver
[11:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:07:04] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:07:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kubernetes::master
[11:08:33] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:10:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Post warmup repooling', diff saved to https://phabricator.wikimedia.org/P53509 and previous config saved to /var/cache/conftool/dbconfig/20231116-111035-arnaudb.json
[11:11:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kubernetes::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974955 (https://phabricator.wikimedia.org/T349619)
[11:12:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch kubernetes::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974955 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:13:22] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:14:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P53510 and previous config saved to /var/cache/conftool/dbconfig/20231116-111436-arnaudb.json
[11:18:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kubernetes::master
[11:19:46] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:20:56] <icinga-wm>	 RECOVERY - Check systemd state on titan1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:31] <godog>	 that was me ^
[11:22:34] <wikibugs>	 (03PS1) 10Hnowlan: device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956
[11:22:59] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[11:23:47] <wikibugs>	 (03CR) 10Sg912: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan)
[11:24:41] <wikibugs>	 (03CR) 10Santiago Faci: [C: 03+1] "looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan)
[11:24:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan)
[11:26:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan)
[11:27:37] <wikibugs>	 (03Merged) 10jenkins-bot: device-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974956 (owner: 10Hnowlan)
[11:28:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[11:28:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update our kerberos scripts to remove oozie customisation [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis)
[11:28:57] <wikibugs>	 (03PS2) 10Btullis: Update our kerberos scripts to remove oozie customisation [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893)
[11:29:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P53512 and previous config saved to /var/cache/conftool/dbconfig/20231116-112942-arnaudb.json
[11:29:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[11:33:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::serviceops
[11:34:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[11:34:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1004.eqiad.wmnet
[11:34:24] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[11:34:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch insetup::serviceops to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974957 (https://phabricator.wikimedia.org/T349619)
[11:40:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::serviceops to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974957 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:41:02] <wikibugs>	 (03PS3) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006)
[11:44:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T348183)', diff saved to https://phabricator.wikimedia.org/P53513 and previous config saved to /var/cache/conftool/dbconfig/20231116-114450-arnaudb.json
[11:44:52] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:44:55] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:44:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved Patches merged, all looking ok.  For example on dns5004 this was situation before, server using TTL 2, CR using 193: ` 19:27:22.338917 IP (...
[11:45:05] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[11:45:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53514 and previous config saved to /var/cache/conftool/dbconfig/20231116-114511-arnaudb.json
[11:45:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::serviceops
[11:49:35] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host clouddb1021.eqiad.wmnet
[11:49:53] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[11:50:19] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[11:50:28] <wikibugs>	 (03PS1) 10Majavah: hieradata: upgrade clouddb1021 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974960
[11:50:37] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421)
[11:50:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[11:51:00] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: upgrade clouddb1021 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974960 (owner: 10Majavah)
[11:51:11] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[11:54:51] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:55:28] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host clouddb1021.eqiad.wmnet
[11:55:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ms-fe1014.eqiad.wmnet
[11:56:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10taavi)
[11:57:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ms-fe1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974962 (https://phabricator.wikimedia.org/T349619)
[11:57:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10MatthewVernon)
[11:58:10] <wikibugs>	 (03PS4) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006)
[11:58:31] <wikibugs>	 (03PS5) 10Effie Mouzeli: (WIP) kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/948551 (https://phabricator.wikimedia.org/T231006)
[11:58:33] <wikibugs>	 (03PS2) 10Majavah: site: remove references to cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077)
[11:58:57] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:59:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) (owner: 10Majavah)
[12:00:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch ms-fe1014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974962 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:00:17] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] site: remove references to cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/973745 (https://phabricator.wikimedia.org/T351077) (owner: 10Majavah)
[12:00:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10MatthewVernon) Approvals-wise, this needs manager approval from @spatton and analytics-privatedata-users approval from @odimi...
[12:04:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[12:07:09] <wikibugs>	 (03PS2) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683
[12:07:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ms-fe1014.eqiad.wmnet
[12:07:52] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Limit Envoy proxing for idm-test. [puppet] - 10https://gerrit.wikimedia.org/r/974933 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[12:08:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo)
[12:14:46] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[12:15:16] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:idm Limit the domains Envoy will proxy. [puppet] - 10https://gerrit.wikimedia.org/r/974925 (https://phabricator.wikimedia.org/T351343) (owner: 10Slyngshede)
[12:16:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host db1124.eqiad.wmnet
[12:16:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[12:17:30] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[12:18:34] <wikibugs>	 (03PS1) 10Jbond: db1124: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974965 (https://phabricator.wikimedia.org/T349619)
[12:18:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] db1124: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974965 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:20:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974966 (owner: 10L10n-bot)
[12:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove reference to row E/F in reimage clear_dhcp_cache function [cookbooks] - 10https://gerrit.wikimedia.org/r/974961 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[12:21:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney)
[12:21:41] <wikibugs>	 (03Merged) 10jenkins-bot: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney)
[12:23:09] <wikibugs>	 (03PS4) 10Volans: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973)
[12:23:11] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.puppet.migrate-host for host cumin2002.codfw.wmnet
[12:23:11] <wikibugs>	 (03PS4) 10Volans: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973)
[12:23:13] <wikibugs>	 (03PS5) 10Volans: sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973)
[12:23:16] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970
[12:23:17] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971
[12:23:19] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973)
[12:23:21] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973)
[12:24:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cumin2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974974 (https://phabricator.wikimedia.org/T349619)
[12:26:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cumin2002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974974 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:27:10] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[12:27:40] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1124.eqiad.wmnet
[12:27:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans)
[12:27:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans)
[12:29:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[12:29:47] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[12:29:53] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[12:33:06] <marostegui>	 !log Install Test MariaDB 10.6.16 (Bookworm) on pc2014 T351283
[12:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:12] <stashbot>	 T351283: Compile and package MariaDB 10.6.16 and 10.4.32 - https://phabricator.wikimedia.org/T351283
[12:33:29] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cumin2002.codfw.wmnet
[12:34:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan)
[12:37:04] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1241 and prepare db1141 retirement [puppet] - 10https://gerrit.wikimedia.org/r/974633 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[12:38:15] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:46:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) I'm gonna close this one for now, if we see an issue again we should get a better error message which should point us to what PuppetDB data triggered i...
[12:46:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved
[12:47:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between CRs and servers - https://phabricator.wikimedia.org/T350488 (10cmooney)
[12:53:32] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:23] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001
[12:54:44] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036
[12:54:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036
[12:54:49] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[12:54:51] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036
[12:55:04] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: provisionning db1241.eqiad.wmnet - T344036
[12:55:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'cloning db1141 - T350458', diff saved to https://phabricator.wikimedia.org/P53515 and previous config saved to /var/cache/conftool/dbconfig/20231116-125515-arnaudb.json
[12:55:21] <stashbot>	 T350458: Decommission db11[26-49] - https://phabricator.wikimedia.org/T350458
[12:55:39] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283)
[12:56:00] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001
[12:56:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'cloning db1141 - T350458', diff saved to https://phabricator.wikimedia.org/P53516 and previous config saved to /var/cache/conftool/dbconfig/20231116-125649-arnaudb.json
[12:58:33] <wikibugs>	 (03PS23) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[13:00:06] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1300)
[13:00:26] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1141.eqiad.wmnet onto db1241.eqiad.wmnet
[13:02:28] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host db1133.eqiad.wmnet
[13:04:02] <wikibugs>	 (03PS1) 10Jbond: db1133: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974980 (https://phabricator.wikimedia.org/T349619)
[13:04:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] db1133: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974980 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:05:14] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui)
[13:05:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui)
[13:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: Bump version [software] - 10https://gerrit.wikimedia.org/r/974978 (https://phabricator.wikimedia.org/T351283) (owner: 10Marostegui)
[13:09:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe1014.eqiad.wmnet
[13:09:24] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1133.eqiad.wmnet
[13:10:12] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host backup1001.eqiad.wmnet
[13:10:36] <wikibugs>	 (03PS2) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/974970
[13:10:38] <wikibugs>	 (03PS2) 10Volans: sre.hardware.upgrade-firmware: run on all hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/974971
[13:10:40] <wikibugs>	 (03PS2) 10Volans: sre.hardware.upgrade-firmware: add custom locking [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973)
[13:10:42] <wikibugs>	 (03PS2) 10Volans: sre.hosts.decommission: acquire lock for each host [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973)
[13:14:00] <wikibugs>	 (03CR) 10Volans: sre.ganeti.*: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[13:14:08] <wikibugs>	 (03PS1) 10Jbond: backup1001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974981 (https://phabricator.wikimedia.org/T349619)
[13:15:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup1001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974981 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:15:15] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[13:15:24] <wikibugs>	 (03CR) 10Volans: sre.hardware.upgrade-firmware: prepare for fixes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans)
[13:17:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1014.eqiad.wmnet
[13:17:54] <wikibugs>	 (03CR) 10Volans: sre.hardware.upgrade-firmware: add custom locking (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[13:19:59] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup1001.eqiad.wmnet
[13:21:46] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host dbprov2001.codfw.wmnet
[13:22:17] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:17] <wikibugs>	 (03PS1) 10Jbond: dbprov2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974983 (https://phabricator.wikimedia.org/T349619)
[13:24:34] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Define a wmflib function to compute the last IP in a subnet [puppet] - 10https://gerrit.wikimedia.org/r/974928 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[13:24:52] <wikibugs>	 (03PS1) 10Majavah: P:ldap::client: updated outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/974985
[13:25:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "❤️" [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:26:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dbprov2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974983 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:26:29] <wikibugs>	 (03PS3) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683
[13:26:31] <wikibugs>	 (03PS1) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986
[13:26:36] <wikibugs>	 (03CR) 10Marostegui: "Remember to update the doc (if needed)" [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:27:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo)
[13:27:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo)
[13:28:49] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ms-be2050.codfw.wmnet
[13:28:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ms-be2050.codfw.wmnet
[13:30:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch bs-be2050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974987 (https://phabricator.wikimedia.org/T349619)
[13:30:31] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dbprov2001.codfw.wmnet
[13:31:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch bs-be2050 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974987 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:33:05] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:33:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host backup2001.codfw.wmnet
[13:34:50] <wikibugs>	 (03PS1) 10Jbond: backup2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974988 (https://phabricator.wikimedia.org/T349619)
[13:34:57] <sergi0>	 !log stat1008: Add `sowiki`, `stwiki`, `tgwiki` and `ugwiki` to `/srv/published/datasets/one-off/research-mwaddlink/wikis.txt` (T340944)
[13:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:03] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[13:35:10] <stashbot>	 T340944: The published dataset's list of wikis misses a couple of wikis with existing data - https://phabricator.wikimedia.org/T340944
[13:35:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[13:35:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974985 (owner: 10Majavah)
[13:35:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] backup2001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974988 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[13:36:36] <wikibugs>	 (03CR) 10Btullis: "Adding Filippo to verify that the alertmanager config is correct." [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis)
[13:36:50] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:ldap::client: updated outdated comment [puppet] - 10https://gerrit.wikimedia.org/r/974985 (owner: 10Majavah)
[13:37:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ms-be2050.codfw.wmnet
[13:39:46] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host backup2001.codfw.wmnet
[13:40:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet
[13:42:48] <wikibugs>	 (03PS1) 10Jforrester: Conditionally render the content of header-action instead of the slot [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121)
[13:44:22] <jynus>	 !log restart bacula at backup1001
[13:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990
[13:44:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846)
[13:45:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[13:47:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet
[13:49:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: prometheus
[13:50:27] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:55] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch prometheus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974992 (https://phabricator.wikimedia.org/T349619)
[13:53:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch prometheus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974992 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:54:17] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:06] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1400). Please do the needful.
[14:00:07] <jouncebot>	 abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:18] <wikibugs>	 (03PS1) 10Arnaudb: decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065)
[14:00:26] <TheresNoTime>	 (unable to deploy today, sorry!)
[14:00:40] <wikibugs>	 (03PS1) 10Btullis: Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523)
[14:01:15] <kart_>	 I can take care of backport.
[14:01:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065) (owner: 10Arnaudb)
[14:01:23] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: move mw-jobrunner to prod [puppet] - 10https://gerrit.wikimedia.org/r/973827 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[14:01:25] <wikibugs>	 (03PS2) 10Btullis: Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523)
[14:02:08] <kart_>	 abijeet: hi. I'll start the deployment.
[14:02:17] <abijeet>	 kart_, ok, thanks!
[14:02:20] <kart_>	 and let you know once it is ready for testing on mwdebug.
[14:02:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273) (owner: 10Abijeet Patro)
[14:03:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: prometheus
[14:04:23] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2034 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:07] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/508/co" [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis)
[14:06:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:06:49] * Lucas_WMDE also around now if needed
[14:07:12] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan)
[14:07:19] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796)
[14:07:23] <stashbot>	 T349796: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796
[14:07:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::monitoring_bullseye
[14:08:40] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796)
[14:08:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney)
[14:09:03] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796)
[14:10:01] <wikibugs>	 (03PS1) 10Jbond: WIP: update get_ca_server [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995
[14:10:01] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796)
[14:10:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/509/con" [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto)
[14:11:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kafka::monitoring_bullseye to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974996 (https://phabricator.wikimedia.org/T349619)
[14:12:54] <wikibugs>	 (03PS2) 10Clément Goubert: mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430)
[14:13:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: kubernetes::global_config: add listener for mw on k8s transition [puppet] - 10https://gerrit.wikimedia.org/r/974947
[14:13:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka::monitoring_bullseye to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974996 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:15:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/510/con" [puppet] - 10https://gerrit.wikimedia.org/r/974947 (owner: 10Giuseppe Lavagetto)
[14:15:42] <jbond>	 !log stop puppet on puppet7 agents to debug puppet performance
[14:15:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:52] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/511/console" [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan)
[14:16:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: update get_ca_server [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond)
[14:19:00] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: remove references to graphite [puppet] - 10https://gerrit.wikimedia.org/r/974946 (https://phabricator.wikimedia.org/T351193) (owner: 10Hnowlan)
[14:20:20] <wikibugs>	 (03Merged) 10jenkins-bot: TranslatablePageMarker: Add patrol status for translatable page [extensions/Translate] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974242 (https://phabricator.wikimedia.org/T351273) (owner: 10Abijeet Patro)
[14:20:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::monitoring_bullseye
[14:21:34] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]]
[14:21:35] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Had a look at https://puppet-compiler.wmflabs.org/output/974500/502/apt1001.wikimedia.org/fulldiff.html and lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:21:39] <stashbot>	 T351273: Revisions for unit markers addition are not longer autopatrolled - https://phabricator.wikimedia.org/T351273
[14:21:54] <claime>	 jouncebot: nowandnext
[14:21:54] <jouncebot>	 For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1400)
[14:21:54] <jouncebot>	 In 2 hour(s) and 38 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1700)
[14:22:34] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:23:03] <logmsgbot>	 !log kartik@deploy2002 kartik and abi: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:23:12] <urbanecm>	 kart_: hi, can you ping me when done deploying please?
[14:23:16] <claime>	 same :D
[14:24:50] <urbanecm>	 wanna go before or after me? :D
[14:25:14] <kart_>	 urbanecm: sure
[14:25:19] <kart_>	 claime: sure
[14:25:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "im going to merge this now.  we have a few issues and im hoping this will help" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[14:25:31] <kart_>	 abijeet: can you test the patch on mwdebug servers?
[14:25:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetserver: change ssldir to a concat fragment [puppet] - 10https://gerrit.wikimedia.org/r/974282 (owner: 10JHathaway)
[14:25:35] <abijeet>	 kart_, ok
[14:25:39] <claime>	 urbanecm: I'm deploying my syslog rule update to mw-on-k8s, I think we can go at the same time depending on what you're deploying
[14:27:53] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1 C: 03+2] Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:28:41] <urbanecm>	 claime: i just want to run git pull on deployment host, it's a beta-specific change :))
[14:29:01] <urbanecm>	 so unless you plan on stopping it for a while, i think we can do both changes at once.
[14:29:06] <claime>	 yep
[14:29:32] <abijeet>	 kart_, doesn't appear to break anything so I'd say we are good to go. will need to ask someone who has autopatrol permissions to check.
[14:29:32] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974966 (owner: 10L10n-bot)
[14:29:49] <urbanecm>	 abijeet: which wiki?
[14:29:53] <kart_>	 abijeet: cool. 
[14:30:06] <abijeet>	 urbanecm, metawiki
[14:30:17] <abijeet>	 urbanecm, specifically this issue: https://phabricator.wikimedia.org/T351273
[14:30:32] <urbanecm>	 abijeet: what is your username? ill grant them to you
[14:30:42] <wikibugs>	 (03CR) 10Joal: [C: 03+1] Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis)
[14:31:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Add crm1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/975000 (https://phabricator.wikimedia.org/T349402)
[14:31:08] <wikibugs>	 (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/973756 (owner: 10L10n-bot)
[14:31:13] <abijeet>	 urbanecm, APatro (WMF) - https://meta.wikimedia.org/wiki/User:APatro_(WMF)
[14:31:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10mpopov) Lovely! Thank you @MatthewVernon!  @OSefu-WMF @Hghani: okay, both of you should now be able to run commands like the ones documented in T350750#9316104
[14:31:43] <urbanecm>	 abijeet: you already seem to have autopatrol?
[14:32:21] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:22] <urbanecm>	 (it is bundled within "Translation administrators")
[14:32:41] <abijeet>	 ah ok.
[14:32:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add crm1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/975000 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[14:34:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Add crm* pattern to partman setup [puppet] - 10https://gerrit.wikimedia.org/r/975002 (https://phabricator.wikimedia.org/T349402)
[14:34:28] <kart_>	 abijeet: let me know when we're ready :)
[14:34:34] <abijeet>	 urbanecm, thanks I see it now on https://meta.wikimedia.org/wiki/Special:ListGroupRights
[14:36:28] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: modules/mesh: add new configuration and networkpolicy modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/974990
[14:36:30] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: modules/mesh: add capability for traffic splitting [deployment-charts] - 10https://gerrit.wikimedia.org/r/974991 (https://phabricator.wikimedia.org/T350846)
[14:36:51] <abijeet>	 kart_, we can go ahead. this change is not breaking any existing fucntionality.
[14:37:23] <kart_>	 awesome!
[14:37:26] <logmsgbot>	 !log kartik@deploy2002 kartik and abi: Continuing with sync
[14:38:39] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add spark.sql.warehouse.dir to spark3 defaults [puppet] - 10https://gerrit.wikimedia.org/r/974993 (https://phabricator.wikimedia.org/T349523) (owner: 10Btullis)
[14:38:57] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add crm* pattern to partman setup [puppet] - 10https://gerrit.wikimedia.org/r/975002 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff)
[14:42:30] <wikibugs>	 (03CR) 10JHathaway: puppetserver: cache code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[14:43:16] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:974242|TranslatablePageMarker: Add patrol status for translatable page (T351273)]] (duration: 21m 41s)
[14:43:20] <stashbot>	 T351273: Revisions for unit markers addition are not longer autopatrolled - https://phabricator.wikimedia.org/T351273
[14:43:56] <jbond>	 !log re-enable puppet on puppet7 agents
[14:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:03] <kart_>	 abijeet: we are done.
[14:44:19] <kart_>	 urbanecm: claime I'm done with deployment..
[14:44:24] <urbanecm>	 thanks
[14:44:30] <claime>	 tyvm
[14:44:36] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert)
[14:44:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:45:34] <wikibugs>	 (03Merged) 10jenkins-bot: IP Masking temp account expiry: Fix a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974728 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:45:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 141626
[14:45:55] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert)
[14:46:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 141626
[14:46:56] * urbanecm done
[14:47:12] <abijeet>	 kart_, thanks
[14:48:16] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] decommission: db1136 [puppet] - 10https://gerrit.wikimedia.org/r/974634 (https://phabricator.wikimedia.org/T351065) (owner: 10Arnaudb)
[14:48:54] <claime>	 !log Redeploying mw-on-k8s for T350430
[14:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:58] <stashbot>	 T350430: php-fpm logs from Kubernetes lack 'message' and 'normalized_message' - https://phabricator.wikimedia.org/T350430
[14:49:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:49:13] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1136.eqiad.wmnet
[14:49:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:49:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:49:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:50:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:50:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:50:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:50:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:51:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:51:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[14:51:33] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:46] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[14:51:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[14:52:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[14:52:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[14:52:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:52:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:53:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[14:53:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:53:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:53:57] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:03] <jinxer-wm>	 (PuppetFailure) firing: (2) Puppet has failed on kubernetes2031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:54:07] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:54:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Send recovery emails to data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis)
[14:54:38] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox
[14:54:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[14:55:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[14:55:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[14:55:49] <wikibugs>	 (03PS1) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388)
[14:55:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[14:55:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[14:56:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[14:56:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[14:56:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[14:56:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[14:56:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[14:56:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[14:56:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[14:56:37] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1136.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[14:57:02] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[14:57:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[14:57:20] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[14:57:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[14:57:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1136.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001"
[14:57:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:57:43] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1136.eqiad.wmnet
[14:57:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[14:57:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'remove db1136', diff saved to https://phabricator.wikimedia.org/P53519 and previous config saved to /var/cache/conftool/dbconfig/20231116-145754-arnaudb.json
[14:58:57] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1136.eqiad.wmnet - https://phabricator.wikimedia.org/T351065 (10ABran-WMF) 05In progress→03Open a:05ABran-WMF→03None
[14:59:17] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/515/co" [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis)
[14:59:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on puppetdb1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:01:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye
[15:01:13] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye
[15:03:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye
[15:03:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on ganeti2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:04:03] <jinxer-wm>	 (PuppetFailure) resolved: (2) Puppet has failed on kubernetes2031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:04:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on bast1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:07:19] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:07:59] <jinxer-wm>	 (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on kubemaster2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:07:59] <jinxer-wm>	 (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on ganeti1012:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:11:39] <wikibugs>	 (03PS1) 10Bking: Revert "elastic relforge: update logstash transport" [puppet] - 10https://gerrit.wikimedia.org/r/974245
[15:12:33] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "elastic relforge: update logstash transport" [puppet] - 10https://gerrit.wikimedia.org/r/974245 (owner: 10Bking)
[15:15:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet
[15:16:58] <wikibugs>	 (03CR) 10DCausse: query_service: add monitoring for ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:17:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335
[15:17:13] <stashbot>	 T324335: Remove logstash from the Search Elasticsearch servers - https://phabricator.wikimedia.org/T324335
[15:18:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage
[15:21:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage
[15:21:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: revert logstash changes - bking@cumin2002 - T324335
[15:22:03] <wikibugs>	 (03PS1) 10Ssingh: conftool: introduce schema and host file for dnsboxes [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054)
[15:22:47] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1002.eqiad.wmnet with OS bullseye
[15:23:32] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/517/console" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:23:39] <wikibugs>	 (03PS1) 10Cwhite: logstash: beta-logs to use current w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/974635 (https://phabricator.wikimedia.org/T350786)
[15:25:32] <wikibugs>	 (03PS6) 10Brouberol: Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059)
[15:26:21] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: beta-logs to use current w3creportingapi template [puppet] - 10https://gerrit.wikimedia.org/r/974635 (https://phabricator.wikimedia.org/T350786) (owner: 10Cwhite)
[15:26:22] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1141.eqiad.wmnet onto db1241.eqiad.wmnet
[15:26:46] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/518/con" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[15:28:12] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "I think we might need to iterate on this a bit but at least this is ready for an initial review. I was a bit unsure of the schema but I sp" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:34:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on kubernetes2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:35:52] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1002.eqiad.wmnet with reason: host reimage
[15:36:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.wikimedia.org with OS bullseye
[15:37:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['aqs1012']
[15:38:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['aqs1012']
[15:38:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['aqs1012']
[15:38:47] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1002.eqiad.wmnet with reason: host reimage
[15:44:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] services: bump cpu limits and Docker images for cp instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/974476 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[15:44:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on kubernetes2008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:51:59] <wikibugs>	 (03PS3) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004)
[15:54:33] <wikibugs>	 (03CR) 10D3r1ck01: "This is pretty much ready and I can go ahead and deploy but I'll like a signal (+1) from either Krinkle or Effie 😊, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[15:54:39] <wikibugs>	 (03PS1) 10FNegri: wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012
[15:55:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: kafka::logging
[15:55:51] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1002.eqiad.wmnet with OS bullseye
[15:56:42] <wikibugs>	 (03PS2) 10FNegri: wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012
[15:56:48] <wikibugs>	 (03PS3) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[15:57:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kafka::logging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975015 (https://phabricator.wikimedia.org/T349619)
[15:57:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:57:39] <wikibugs>	 (03CR) 10Bking: query_service: add monitoring for ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:57:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch kafka::logging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975015 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:58:57] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:59:37] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] Generate subnet DHCP configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[16:00:18] <wikibugs>	 (03PS4) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[16:01:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:02:06] <wikibugs>	 (03PS7) 10Brouberol: Generate subnet DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059)
[16:03:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: kafka::logging
[16:03:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4037.ulsfo.wmnet
[16:04:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975016 (https://phabricator.wikimedia.org/T349619)
[16:05:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975016 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[16:08:32] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/519/con" [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[16:08:32] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['aqs1012']
[16:09:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4037.ulsfo.wmnet
[16:11:16] <wikibugs>	 (03CR) 10Brouberol: Generate subnet DHCP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974998 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[16:14:04] <wikibugs>	 (03PS1) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[16:15:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:17:04] <sukhe>	 !log depool cp4037 for reboot [post puppet 7 upgrade]
[16:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:37] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host acmechief1001.eqiad.wmnet with OS bookworm
[16:18:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet
[16:18:35] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1108.eqiad.wmnet
[16:18:36] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1108.eqiad.wmnet
[16:19:14] <wikibugs>	 (03PS2) 10Hnowlan: wmnet: add mw-jobrunner discovery record [dns] - 10https://gerrit.wikimedia.org/r/975020 (https://phabricator.wikimedia.org/T349796)
[16:20:27] <fabfur>	 !log swapped cp1108 <-> cp1083 (T349244)
[16:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:34] <stashbot>	 T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244
[16:21:26] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye
[16:21:47] <wikibugs>	 (03CR) 10Effie Mouzeli: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[16:21:49] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1009.wikimedia.org with OS bullseye
[16:23:03] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet
[16:24:27] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1109.eqiad.wmnet
[16:24:28] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1109.eqiad.wmnet
[16:26:06] <fabfur>	 !log swapped cp1109 <-> cp1084 (T349244)
[16:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:11] <stashbot>	 T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244
[16:26:48] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage
[16:26:50] <wikibugs>	 (03PS4) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796)
[16:26:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[16:27:17] <wikibugs>	 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Jclark-ctr) {F41511225} Reseated hard drives. update idrac and bios firmware
[16:27:34] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet
[16:30:07] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on 6 hosts with reason: Extending downtime for depooled cp hosts
[16:30:23] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on 6 hosts with reason: Extending downtime for depooled cp hosts
[16:30:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=154babc2-d86e-4f5b-baf5-fb36e9d129e4) set by fabfur@cumin1001 for 14 days, 0:00:00 on 6 host(s) and thei...
[16:31:21] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage
[16:33:08] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:11] <sukhe>	 !log repool cp4037
[16:37:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:53] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[16:44:35] <wikibugs>	 (03PS1) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024
[16:45:01] <wikibugs>	 (03PS2) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024
[16:45:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "missing Bug: header on the commit msg, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/975024 (owner: 10BCornwall)
[16:45:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10Sfaci)
[16:46:10] <wikibugs>	 (03PS3) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154)
[16:47:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: bump cpu limits and Docker images for cp instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/974476 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[16:48:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10WDoranWMF) Approved as @Sfaci 's manager
[16:48:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would assume you'd also need to:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:49:21] <logmsgbot>	 !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief1001.eqiad.wmnet with OS bookworm
[16:49:23] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/521/con" [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:50:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[16:50:42] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[16:51:16] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[16:52:18] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[16:53:37] <wikibugs>	 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) >>! In T351320#9338030, @Jclark-ctr wrote: > {F41511225} Reseated hard drives. update idrac and bios firmware   I confirmed this to be the case before proceeding, but after restarting via the reimage cook...
[16:58:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye
[17:00:05] <jouncebot>	 jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1700)
[17:00:05] <jouncebot>	 urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:22] <brett>	 !log Disabling puppet on all acme-chief clients for acme-chief bookworm upgrades - T342154
[17:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:29] <urbanecm>	 Here!
[17:00:42] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[17:01:10] <rzl>	 urbanecm: hey! one sec
[17:01:27] <urbanecm>	 Sure
[17:02:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53522 and previous config saved to /var/cache/conftool/dbconfig/20231116-170241-arnaudb.json
[17:03:09] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:04:00] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Set acmechief1001 as active [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154)
[17:04:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:04:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:05:29] <rzl>	 urbanecm: lgtm, need a manual run?
[17:05:59] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] mediawiki: Add missing frequency param to the purge_temporary_accounts job [puppet] - 10https://gerrit.wikimedia.org/r/974726 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[17:06:06] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/522/con" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:06:16] <icinga-wm>	 PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[17:06:29] <vgutierrez>	 ^^ expected
[17:06:56] <urbanecm>	 rzl: thanks! Not needed, the related feature is currently only enabled on beta, and i can run it myself there :)). Thanks for the +2!
[17:07:05] <rzl>	 👍
[17:07:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[17:07:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[17:08:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief1001.eqiad.wmnet
[17:08:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief1001.eqiad.wmnet
[17:08:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Send recovery emails to data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis)
[17:08:52] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] acme-chief: Set acmechief1001 as active [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:12:14] <wikibugs>	 10SRE, 10ops-esams, 10DC-Ops: Audit future knams power usage - https://phabricator.wikimedia.org/T331358 (10RobH) 05Open→03Invalid Never followed through on updated estiamtes since drmrs gave us a very clear indicator on what our esams racks woudl use (identical circuits and hardware) so this wasn't needed.
[17:12:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage
[17:13:02] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudelastic1009.wikimedia.org with reason: host reimage
[17:13:51] <wikibugs>	 (03PS5) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[17:14:24] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154)
[17:14:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:16:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye
[17:17:04] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[17:17:40] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/975046 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:17:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P53523 and previous config saved to /var/cache/conftool/dbconfig/20231116-171748-arnaudb.json
[17:18:16] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1010.wikimedia.org with OS bullseye
[17:19:29] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] acme-chief: Switch acmechief_host to acmechief1001 [puppet] - 10https://gerrit.wikimedia.org/r/975047 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:19:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye
[17:20:06] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1010.wikimedia.org with OS bullseye
[17:21:19] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[17:21:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[17:22:52] <icinga-wm>	 RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2001 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[17:23:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye
[17:25:28] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:04] <vgutierrez>	 that unit shouldn't be there anymore? :)
[17:26:15] <vgutierrez>	 or maybe icinga host needs a puppet agent run :)
[17:26:58] <vgutierrez>	 Loaded: not-found (Reason: Unit reload-acme-chief-backend.service not found.)
[17:27:00] <brett>	 !log Re-enabling puppet on all acme-chief clients post-bookworm upgrade - T342154
[17:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:09] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[17:28:08] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1009.wikimedia.org with OS bullseye
[17:29:27] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye
[17:30:09] <wikibugs>	 (03PS2) 10Bking: Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725
[17:30:41] <wikibugs>	 (03CR) 10Bking: Revert "staging-eqiad: raise rdf-streaming-updater quota" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[17:32:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P53525 and previous config saved to /var/cache/conftool/dbconfig/20231116-173254-arnaudb.json
[17:33:19] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch)
[17:34:04] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[17:35:48] <icinga-wm>	 PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:53] <brett>	 ^forgot to arm keyholder. Hopefully that fixes this
[17:44:34] <icinga-wm>	 RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:37] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[17:47:15] <wikibugs>	 (03Abandoned) 10BCornwall: acme_chief: Remove acmechief1001 passive host [puppet] - 10https://gerrit.wikimedia.org/r/975024 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[17:48:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53526 and previous config saved to /var/cache/conftool/dbconfig/20231116-174800-arnaudb.json
[17:48:03] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:48:06] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:48:16] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[17:50:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974971 (owner: 10Volans)
[17:54:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974970 (owner: 10Volans)
[17:55:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974972 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[17:55:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974973 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[17:59:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney)
[17:59:41] <wikibugs>	 (03Merged) 10jenkins-bot: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1800)
[18:06:18] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:14:55] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[18:15:07] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[18:19:51] <wikibugs>	 (03PS1) 10Btullis: Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910)
[18:20:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[18:21:14] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/524/con" [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[18:31:15] <wikibugs>	 (03PS1) 10DDesouza: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353)
[18:31:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 (owner: 10FNegri)
[18:34:38] <icinga-wm>	 PROBLEM - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[18:35:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cold-migrate: quick fixes [puppet] - 10https://gerrit.wikimedia.org/r/975012 (owner: 10FNegri)
[18:44:07] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye
[18:46:15] <wikibugs>	 (03PS1) 10Tchanders: ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449)
[18:48:05] <wikibugs>	 (03PS2) 10Jbond: puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995
[18:49:03] <wikibugs>	 (03CR) 10Jbond: "Let m know what you think of the general approach and if good ill update the tests etc." [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond)
[18:50:21] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[18:51:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[18:53:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[18:53:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr)
[18:54:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker11 - jclark@cumin1001"
[18:55:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker11 - jclark@cumin1001"
[18:55:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:56:01] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067
[18:56:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond)
[18:56:22] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[18:56:31] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[18:56:38] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[18:57:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) p:05Triage→03Low
[18:59:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED
[18:59:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED
[18:59:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED
[18:59:56] <bawolff>	 Emperor: I wanted to ask you about your thoughts on increasing the max upload size from 4GB to 5GB (or failing that, allowing users to request such uploads on a case by case basis). I'm told you're the person to talk to. For background context the previous limit was due to storing file size as a 32 bit integer, which has now been changed so is no longer a limting factor. I would appreciate 
[19:00:02] <bawolff>	 your thoughts on T191804.
[19:00:04] <logmsgbot>	 !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[19:00:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED
[19:00:06] <stashbot>	 T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804
[19:00:06] <jouncebot>	 jeena and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T1900).
[19:00:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[19:00:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED
[19:00:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED
[19:00:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[19:01:54] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[19:02:00] <logmsgbot>	 !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[19:02:33] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans)
[19:02:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Add BGP to the contributing protocols for aggregate routes on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456)
[19:02:54] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081)
[19:02:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[19:03:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans)
[19:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975071 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[19:04:08] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[19:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: fix puppet version choice [cookbooks] - 10https://gerrit.wikimedia.org/r/975067 (owner: 10Volans)
[19:07:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[19:08:21] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:08:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:09:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1163
[19:09:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1163
[19:09:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1164
[19:09:59] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1164
[19:10:23] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[19:10:37] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.5  refs T350081
[19:10:45] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[19:14:01] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[19:16:46] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[19:19:49] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073
[19:22:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:22:55] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage
[19:24:52] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1161.mgmt.eqiad.wmnet with reboot policy FORCED
[19:24:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1157.mgmt.eqiad.wmnet with reboot policy FORCED
[19:25:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1159.mgmt.eqiad.wmnet with reboot policy FORCED
[19:25:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1162.mgmt.eqiad.wmnet with reboot policy FORCED
[19:25:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1160.mgmt.eqiad.wmnet with reboot policy FORCED
[19:25:51] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage
[19:26:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[19:26:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[19:26:49] <wikibugs>	 (03CR) 10Dzahn: puppetserver: create a necessary parent dirs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:27:25] <wikibugs>	 (03PS2) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073
[19:28:50] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1158.mgmt.eqiad.wmnet with reboot policy FORCED
[19:30:12] <jinxer-wm>	 (LVSHighRX) firing: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[19:31:09] <sukhe>	 hmm
[19:31:25] <wikibugs>	 (03CR) 10Andrew Bogott: puppetserver: create a necessary parent dirs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:32:19] <wikibugs>	 (03PS3) 10Andrew Bogott: puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073
[19:40:12] <jinxer-wm>	 (LVSHighRX) resolved: Excessive RX traffic on lvs1019:9100 (eno1np0) #page - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1019 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX
[19:41:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/528/con" [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:44:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[19:44:47] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/974285/526/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[19:45:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: create a necessary parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:46:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for IPs in public1-b-codfw vlan - cmooney@cumin1001"
[19:46:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] puppetserver: create a necessary parent dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott)
[19:46:45] <wikibugs>	 (03PS7) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285
[19:47:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS for IPs in public1-b-codfw vlan - cmooney@cumin1001"
[19:47:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:50:38] <wikibugs>	 (03PS2) 10Jcrespo: Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986
[19:51:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Transferer: Update logic for is_empty_dir() to avoid future bugs [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo)
[19:53:23] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075
[19:53:27] <wikibugs>	 (03CR) 10Jcrespo: "I believe this works in my testing, but want a double check." [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo)
[19:54:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott)
[19:54:25] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1012.eqiad.wmnet with OS bullseye
[19:54:36] <wikibugs>	 (03PS2) 10Andrew Bogott: puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075
[19:58:42] <wikibugs>	 (03PS4) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683
[19:58:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott)
[19:58:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: only create '/var/lib/puppet/server' when needed [puppet] - 10https://gerrit.wikimedia.org/r/975075 (owner: 10Andrew Bogott)
[19:59:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 (owner: 10Jcrespo)
[19:59:51] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:00:24] <wikibugs>	 (03PS8) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285
[20:06:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/974285/530/" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[20:14:05] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2053 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:08] <wikibugs>	 (03CR) 10Jcrespo: "There is one thing missing, which is handling the new exceptions:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[20:18:17] <wikibugs>	 (03PS4) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[20:18:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[20:23:20] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Started deploy [airflow-dags/search@b00c6ca]: Deploying Airflow search WDQS graph split HDFS job
[20:23:47] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Finished deploy [airflow-dags/search@b00c6ca]: Deploying Airflow search WDQS graph split HDFS job (duration: 00m 27s)
[20:27:53] <wikibugs>	 (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester)
[20:41:03] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: remove log spam from user home dir sync [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465)
[20:41:20] <topranks>	 !log adding anycast GW for public1-b-codfw vlan to codfw spine switches (T347191)
[20:41:22] <wikibugs>	 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10kostajh)
[20:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:25] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[20:41:42] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway)
[20:46:43] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:47:51] <topranks>	 ^^ this is due to me, BGP reset but came back up 
[20:48:46] <sukhe>	 thanks!
[20:49:57] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:50:58] <topranks>	 ^^ this is doh2002, investigating 
[20:51:17] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:52:41] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:52:43] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:53:11] <topranks>	 ^^ this required a manual reset on doh2002 for BFD I didn't expect 
[20:53:22] <topranks>	 clear on the CR side didn't resolve
[20:53:41] <topranks>	 proceeding to next step
[20:54:17] <topranks>	 !log changing VRRP GW IP for public1-b-codfw on codfw CRs and disabling IPv6 RAs on the CRs (T347191)
[20:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:21] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[20:56:43] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:58:57] <icinga-wm>	 RECOVERY - Disk space on druid1010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops
[20:59:19] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:59:27] <icinga-wm>	 PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
[20:59:39] <icinga-wm>	 PROBLEM - Host ldap-rw2001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:59:43] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:59:49] <dr0ptp4kt>	 brennen: i might be interested to do the config deployment if that's okay with you
[21:00:07] <jouncebot>	 brennen and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231116T2100).
[21:00:07] <jouncebot>	 danisztls and James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:17] <icinga-wm>	 RECOVERY - Host dns2004 is UP: PING WARNING - Packet loss = 90%, RTA = 33.21 ms
[21:00:18] <danisztls>	 o/
[21:00:19] <sukhe>	 topranks: need me to check anything?
[21:00:28] <James_F>	 o/
[21:00:29] <icinga-wm>	 PROBLEM - Host 208.80.153.48 is DOWN: PING CRITICAL - Packet loss = 100%
[21:00:37] <icinga-wm>	 RECOVERY - Host ldap-rw2001 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms
[21:00:39] <brennen>	 o/
[21:01:15] <icinga-wm>	 RECOVERY - Host 208.80.153.48 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms
[21:01:15] <TheresNoTime>	 (I'm not able to deploy this evening)
[21:01:23] <icinga-wm>	 RECOVERY - Disk space on druid1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops
[21:01:25] <brennen>	 dr0ptp4kt: sure. :)
[21:01:39] <icinga-wm>	 RECOVERY - Disk space on druid1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops
[21:02:29] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:03:09] <topranks>	 sukhe: thanks not right now 
[21:03:21] <jinxer-wm>	 (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:03:39] <topranks>	 things seemed to be ok, I attempted rollback but got worse so pushed forward and all seems ok 
[21:03:53] <sukhe>	 np! gl
[21:04:07] <icinga-wm>	 PROBLEM - LDAP -read-only server- on ldap-replica2005 is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[21:04:47] <jinxer-wm>	 (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:05:11] <wikibugs>	 (03PS2) 10Dr0ptp4kt: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:06:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:07:02] <wikibugs>	 (03Merged) 10jenkins-bot: Pre-deploy Annual Plan Core Metrics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975059 (https://phabricator.wikimedia.org/T351353) (owner: 10DDesouza)
[21:07:16] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]]
[21:07:21] <stashbot>	 T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353
[21:08:36] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and dani: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:09:22] <dr0ptp4kt>	 danisztls: please check
[21:10:06] <wikibugs>	 10ops-codfw, 10ops-esams, 10DC-Ops: ship MPC5E-40G10G-IRB from esams to codfw - https://phabricator.wikimedia.org/T351467 (10RobH) p:05Triage→03High
[21:12:05] <danisztls>	 dr0ptp4kt: looks good, I'm not able to fully test right now as the messages weren't created yet but coverage is set to 0
[21:12:11] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:26] <dr0ptp4kt>	 danisztls: okay, would you prefer we sync or rather abandon?
[21:12:39] <danisztls>	 dr0ptp4kt: sync
[21:12:42] <dr0ptp4kt>	 on it
[21:12:45] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and dani: Continuing with sync
[21:13:37] <mutante>	 @seen xqt
[21:17:29] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders)
[21:18:23] <wikibugs>	 (03PS1) 10Sohom Datta: Make the feed gracefully handle long snippets and urls [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732)
[21:18:29] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:975059|Pre-deploy Annual Plan Core Metrics survey (T351353)]] (duration: 11m 12s)
[21:18:33] <stashbot>	 T351353: Deploy survey (Community Feedback on Core Metrics Reports) on Meta-Wiki - https://phabricator.wikimedia.org/T351353
[21:18:49] <dr0ptp4kt>	 danisztls: sync'd
[21:19:41] <Sohom_Datta>	 Um sorry if I'm late to the deployment window, can https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/975028 be backported (it fixes a regression in the newer PageTriage UI)
[21:19:44] <danisztls>	 dr0ptp4kt: thanks!
[21:21:02] <wikibugs>	 (03PS1) 10RobH: update site.pp and partition info for new an-workers [puppet] - 10https://gerrit.wikimedia.org/r/975085 (https://phabricator.wikimedia.org/T349936)
[21:21:02] <Sohom_Datta>	 It's fine if the answer is a no, asking since this is the last deploy window before the weekend :)
[21:21:33] <wikibugs>	 (03CR) 10RobH: [C: 03+2] update site.pp and partition info for new an-workers [puppet] - 10https://gerrit.wikimedia.org/r/975085 (https://phabricator.wikimedia.org/T349936) (owner: 10RobH)
[21:21:39] <thcipriani>	 Sohom_Datta: looking now
[21:21:59] <NovemLinguae>	 I came here to ask the same thing. Sohom is way ahead of me :)
[21:23:24] <TheresNoTime>	 (might be able to take a look in a little bit if someone else doesn't)
[21:23:33] <dr0ptp4kt>	 Looks okay Sohom_Datta would you please add it to https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_November_16 and let us know once added?
[21:24:42] <thcipriani>	 Sohom_Datta: NovemLinguae we've got one in the queue in front of you, we'll get it out after James_F 's patch :)
[21:25:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1163.mgmt.eqiad.wmnet with reboot policy FORCED
[21:25:13] <NovemLinguae>	 ty :)
[21:25:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1164.mgmt.eqiad.wmnet with reboot policy FORCED
[21:26:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dr0ptp4kt@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester)
[21:26:34] <James_F>	 thcipriani: <3
[21:26:43] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is CRITICAL: connect to address 10.64.32.145 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[21:26:49] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.32.145:7000 on aqs1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:26:55] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:27:09] <Sohom_Datta>	 dr0ptp4kt: Added :)
[21:27:16] <Sohom_Datta>	 Thank you :)
[21:27:23] <dr0ptp4kt>	 Sohom_Datta: thanks, will take a check
[21:29:14] <wikibugs>	 (03PS5) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm)
[21:30:21] <icinga-wm>	 RECOVERY - LDAP -read-only server- on ldap-replica2005 is OK: LDAP OK - 0.107 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[21:30:31] <wikibugs>	 (03Merged) 10jenkins-bot: Conditionally render the content of header-action instead of the slot [extensions/WikiLambda] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974244 (https://phabricator.wikimedia.org/T351121) (owner: 10Jforrester)
[21:30:43] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]]
[21:30:50] <stashbot>	 T351121: Button to run implementations and testers is gone - https://phabricator.wikimedia.org/T351121
[21:31:37] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C: 03+2] "Preparing for backport window" [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732) (owner: 10Sohom Datta)
[21:31:59] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and jforrester: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:32:13] <dr0ptp4kt>	 James_F: would you please have a look and let when good to sync?
[21:32:25] <James_F>	 dr0ptp4kt: Yup, all looks good!
[21:32:31] <dr0ptp4kt>	 On it
[21:32:34] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and jforrester: Continuing with sync
[21:32:36] <James_F>	 (Almost like I had the page ready in debug to test.)
[21:32:39] <James_F>	 Thank you!
[21:33:45] <thcipriani>	 in case you *weren't* already hovering over the refresh button :)
[21:38:20] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:974244|Conditionally render the content of header-action instead of the slot (T351121)]] (duration: 07m 36s)
[21:38:27] <James_F>	 Thank you again.
[21:38:36] <dr0ptp4kt>	 Thank you, as always.
[21:38:36] <stashbot>	 T351121: Button to run implementations and testers is gone - https://phabricator.wikimedia.org/T351121
[21:40:47] <dr0ptp4kt>	 Sohom_Datta: still going through gate and submit...
[21:42:59] <topranks>	 !log Removing VRRP config for for public1-b-codfw on codfw CRs (T347191)
[21:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:04] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[21:44:17] <icinga-wm>	 PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:19] <icinga-wm>	 PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:47] <icinga-wm>	 PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:07] <icinga-wm>	 RECOVERY - Host doh2002 is UP: PING WARNING - Packet loss = 77%, RTA = 33.35 ms
[21:46:09] <icinga-wm>	 RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[21:46:19] <jinxer-wm>	 (ProbeDown) firing: Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint2002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:46:39] <brett>	 sukhe: You doing something fun with doh?
[21:47:03] <topranks>	 that was me sry, "cleaning up" after previous work seems I'd left teh VIP on the CRs
[21:47:21] <topranks>	 they'll clear shortly, reverted immediately 
[21:47:26] <brett>	 thanks
[21:47:41] <icinga-wm>	 RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms
[21:48:39] <wikibugs>	 (03Merged) 10jenkins-bot: Make the feed gracefully handle long snippets and urls [extensions/PageTriage] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975028 (https://phabricator.wikimedia.org/T347732) (owner: 10Sohom Datta)
[21:49:02] <thcipriani>	 merged \o/
[21:49:10] <Sohom_Datta>	 \o/
[21:50:03] <icinga-wm>	 PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
[21:50:21] <icinga-wm>	 PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100%
[21:50:23] <topranks>	 sry...
[21:50:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr)
[21:50:26] <topranks>	 on it 
[21:50:37] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Started scap: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]]
[21:50:41] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:50:43] <stashbot>	 T347732: Mock up a 100% Codex front end for PageTriage - https://phabricator.wikimedia.org/T347732
[21:50:43] <stashbot>	 T351463: mwe-vue-pt-snippet is way too narrow - https://phabricator.wikimedia.org/T351463
[21:51:17] <icinga-wm>	 RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms
[21:51:19] <jinxer-wm>	 (ProbeDown) resolved: Service contint2002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint2002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:51:50] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and soda: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:52:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157']
[21:52:12] <dr0ptp4kt>	 Sohom_Datta: would you please check and advise if okay to commence with sync?
[21:52:24] <Sohom_Datta>	 On it
[21:52:31] <icinga-wm>	 RECOVERY - Host serpens is UP: PING WARNING - Packet loss = 77%, RTA = 33.49 ms
[21:52:48] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1157']
[21:53:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157']
[21:53:14] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-worker1157']
[21:53:21] <jinxer-wm>	 (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:53:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157']
[21:53:27] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:53:29] <icinga-wm>	 PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:53:53] <icinga-wm>	 PROBLEM - Host dns2004 is DOWN: PING CRITICAL - Packet loss = 100%
[21:54:11] <brett>	 topranks: This you too?
[21:54:22] <Sohom_Datta>	 Looks good to me :)
[21:54:34] <Sohom_Datta>	 Thanks a lot for doing this on such a short notice :)
[21:54:50] <dr0ptp4kt>	 thx Sohom_Datta, will begin sync in a few secs
[21:54:53] <logmsgbot>	 !log dr0ptp4kt@deploy2002 dr0ptp4kt and soda: Continuing with sync
[21:54:55] <topranks>	 brett: yeah certainly although looks up to me 
[21:55:00] <jinxer-wm>	 (ProbeDown) firing: (5) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:55:19] <icinga-wm>	 RECOVERY - Host doh2002 is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms
[21:55:51] <icinga-wm>	 PROBLEM - SSH on contint2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:56:11] <icinga-wm>	 RECOVERY - Host dns2004 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms
[21:57:01] <icinga-wm>	 RECOVERY - SSH on contint2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:57:59] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:10] <brett>	 :O
[21:59:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1157']
[22:00:27] <logmsgbot>	 !log dr0ptp4kt@deploy2002 Finished scap: Backport for [[gerrit:975028|Make the feed gracefully handle long snippets and urls (T347732 T351463)]] (duration: 09m 50s)
[22:00:37] <stashbot>	 T347732: Mock up a 100% Codex front end for PageTriage - https://phabricator.wikimedia.org/T347732
[22:00:38] <stashbot>	 T351463: mwe-vue-pt-snippet is way too narrow - https://phabricator.wikimedia.org/T351463
[22:00:41] <dr0ptp4kt>	 thx Sohom_Datta, sync done
[22:02:04] <topranks>	 brett: the puppetserver2002 alert is not related to anything I'm working on, only codfw row B public vlan is what I'm at
[22:02:32] <Sohom_Datta>	 Can confirm that it works on my end after clearing the browser cache :)
[22:03:14] <brett>	 ack
[22:04:00] <brett>	 Hm, but puppet2002 doesn't have sync-puppet-volatile.service?
[22:04:15] <thcipriani>	 jouncebot: now
[22:04:16] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 55 minute(s)
[22:05:41] <topranks>	 brett: I need to try to flip this gw again, it may trigger BFD/BGP alerts on the dns/doh hosts, but right now things are inconsistent which we can't leave that way.  
[22:05:44] <brett>	 oh, puppetserver
[22:05:50] <brett>	 cool, thanks for the heads up
[22:06:05] <topranks>	 those hosts are backed up in terms of function, so in terms of services we should be good
[22:06:16] <topranks>	 I'd rather not downtime as the alerts may be useful - sorry for the noise
[22:07:09] <brett>	 sync-puppet-volatile.service is all good. Temporary dns resolution failure
[22:07:41] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:09:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye
[22:09:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye
[22:21:09] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetserver: '/srv/puppet_code/environments' owned by puppet/puppet [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468)
[22:21:39] <icinga-wm>	 PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 1 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[22:22:22] <brett>	 Assuming this is related
[22:26:06] <topranks>	 brett: sorry, yeah it's giving out cos I removed the equivalent on cr2, I'm removing cr1 now so it should resolve shortly
[22:26:14] <brett>	 No prob!
[22:26:23] <topranks>	 I owe you a beer I think :)
[22:27:04] <brett>	 not at all!
[22:27:11] <icinga-wm>	 RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[22:27:51] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:28:42] <topranks>	 ^^ uncommited dns is probably me, I'll run the cookbook (fairly sure I don't have to re-add those IPs things look ok)
[22:29:03] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[22:30:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[22:30:41] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 1117 entries - cmooney@cumin1001"
[22:31:29] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-11-14 11:01:41 +0000 (expires in 1824 days) https://wikitech.wikimedia.org/wiki/Logs
[22:31:30] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 1117 entries - cmooney@cumin1001"
[22:31:30] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:34:08] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[22:36:04] <mutante>	 !log disabled puppet on miscweb*, netmon* and phab* hosts, deploying gerrit:974285, confirming noop 
[22:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:33] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[22:38:56] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[22:39:09] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[22:39:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53529 and previous config saved to /var/cache/conftool/dbconfig/20231116-223915-arnaudb.json
[22:39:20] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:40:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "confirmed noop on all miscweb*, netmon* and phab* prod machines. additionally compiled on a cloud VPS using simplelamp2 role." [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[22:50:53] <wikibugs>	 (03PS1) 10Dzahn: piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093
[22:51:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn)
[22:52:58] <wikibugs>	 (03PS1) 10Dzahn: simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094
[22:54:52] <wikibugs>	 (03PS2) 10Dzahn: piwik: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975093
[22:55:22] <wikibugs>	 (03PS2) 10Dzahn: simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094
[23:08:21] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:10:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: Move public1-a-codfw vlan GW from codfw CR routers to ssw
[23:10:25] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr[1-2]-codfw,cr[1-2]-codfw IPv6 with reason: Move public1-a-codfw vlan GW from codfw CR routers to ssw
[23:10:31] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c937612c-c0eb-4c9e-a245-9810a56c0a33) set by cmooney@cu...
[23:12:29] <wikibugs>	 (03PS1) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362)
[23:13:04] <wikibugs>	 (03PS2) 10Jdlrobson: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362)
[23:13:38] <Jdlrobson>	 jeena: could we backport the above patch to get the error rate back down to normal?
[23:13:57] <Jdlrobson>	 it disables the codepath that is erroring (which is broken anyway :-))
[23:21:13] <TheresNoTime>	 Jdlrobson: need a deployer?
[23:25:10] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[23:27:13] <Jdlrobson>	 TheresNoTime: if you could!
[23:27:14] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 2001 entries - cmooney@cumin1001"
[23:27:19] <TheresNoTime>	 ack!
[23:27:22] <Jdlrobson>	 Would be nice to go into the weekend without lots of email alerts :)
[23:27:27] <TheresNoTime>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/975097/ correct?
[23:27:33] <Jdlrobson>	 correct
[23:27:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) (owner: 10Jdlrobson)
[23:27:49] <Jdlrobson>	 What's the process for this? Do I need to log it on https://wikitech.wikimedia.org/wiki/Deployments somewhere?
[23:28:05] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old vlan 2001 entries - cmooney@cumin1001"
[23:28:05] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:28:26] <wikibugs>	 (03Merged) 10jenkins-bot: Disable drawer temporarily while erroring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975097 (https://phabricator.wikimedia.org/T351362) (owner: 10Jdlrobson)
[23:28:39] <TheresNoTime>	 Jdlrobson: I'll log it in the SAL
[23:28:43] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:975097|Disable drawer temporarily while erroring (T351362)]]
[23:28:50] <stashbot>	 T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362
[23:28:51] <Jdlrobson>	 TheresNoTime: thx ! I can check this on stat1001 before you sync
[23:29:02] <Jdlrobson>	 debug2001 rather :)
[23:29:08] <Jdlrobson>	 in analytics mindset haha
[23:29:16] <wikibugs>	 (03PS8) 10Krinkle: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[23:29:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye
[23:29:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke...
[23:29:58] <wikibugs>	 (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[23:29:59] <logmsgbot>	 !log samtar@deploy2002 jdlrobson and samtar: Backport for [[gerrit:975097|Disable drawer temporarily while erroring (T351362)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:30:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye
[23:30:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye
[23:30:21] <TheresNoTime>	 Jdlrobson: ready on mwdebug
[23:30:26] <topranks>	 !log Add gateway IP for public1-a-codfw Vlan to ssw in codfw T347191
[23:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:44] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[23:31:03] <Jdlrobson>	 thanks looking
[23:33:22] <topranks>	 !log Change VRRP IP for public1-a-codfw vlan on codfw CRs T347191
[23:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:29] <jeena>	 sorry Jdlrobson, I stepped away for a moment. Thanks TheresNoTime 
[23:33:37] <TheresNoTime>	 np!
[23:33:39] <Jdlrobson>	 TheresNoTime: oh no.. it doesn't look like this fully solves the issue like I hoped. :(
[23:33:46] <Jdlrobson>	 So I guess there's no point in syncing it
[23:33:57] <TheresNoTime>	 Jdlrobson: ack :(
[23:34:01] <logmsgbot>	 !log samtar@deploy2002 Sync cancelled.
[23:34:31] <wikibugs>	 (03PS1) 10Samtar: Revert "Disable drawer temporarily while erroring" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029
[23:34:34] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096 should also fix it but it hasn't been reviewed yet
[23:34:39] <Jdlrobson>	 so I am not sure what protocol is for that.
[23:34:51] <Jdlrobson>	 it's pretty simple: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096/1/src/mobile.startup/mobile.startup.js what do you think TheresNoTime  ?
[23:35:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029 (owner: 10Samtar)
[23:35:23] <TheresNoTime>	 Jdlrobson: I'll take a look
[23:35:28] <jeena>	 It looks pretty simple
[23:35:40] <Jdlrobson>	 I assume it's cheap to backport it, test it?
[23:35:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Disable drawer temporarily while erroring" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975029 (owner: 10Samtar)
[23:35:49] <Jdlrobson>	 and unbackport it if it doesn't work?
[23:35:55] <jeena>	 seems like it to me
[23:35:59] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]]
[23:36:07] <TheresNoTime>	 Ideally someone would +1 it first
[23:36:27] <icinga-wm>	 PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:03] <Jdlrobson>	 TheresNoTime: i'll see if I can get someone in the team to vouch for it. It's near the end of the day though so am not sure who is still around (I'm the furthest west).
[23:37:16] <logmsgbot>	 !log samtar@deploy2002 samtar: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:37:16] <Jdlrobson>	 I can get one for tomorrow if we're okay with a backport tomorrow?
[23:37:33] <logmsgbot>	 !log samtar@deploy2002 samtar: Continuing with sync
[23:37:35] <icinga-wm>	 RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms
[23:37:35] <jeena>	 let me check
[23:38:24] <jeena>	 We can do it tomorrow if needed
[23:39:27] <TheresNoTime>	 just syncing that revert (not entirely sure if I needed to, but *shrug*)
[23:40:43] <Jdlrobson>	 thanks TheresNoTime and sorry for the run around
[23:40:55] <TheresNoTime>	 not a problem at all! :)
[23:43:31] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:975029|Revert "Disable drawer temporarily while erroring"]] (duration: 07m 31s)
[23:43:53] <Jdlrobson>	 Okay jeena i'll ping you tomorrow  since I can't seem to find a review from my team
[23:44:00] <jeena>	 okay
[23:44:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:46:26] <TheresNoTime>	 hm
[23:46:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158']
[23:49:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 46.15% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[23:51:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1158.eqiad.wmnet with OS bullseye
[23:51:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye
[23:52:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1158']
[23:58:31] <icinga-wm>	 PROBLEM - Host doh2001 is DOWN: PING CRITICAL - Packet loss = 100%
[23:58:42] <topranks>	 ^^ just doing a test with this one
[23:59:37] <icinga-wm>	 RECOVERY - Host doh2001 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms