[00:05:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451817 (10phaultfinder)
[00:24:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451822 (10phaultfinder)
[00:38:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459
[00:38:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459 (owner: 10TrainBranchBot)
[00:49:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451871 (10phaultfinder)
[00:53:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10451876 (10VRiley-WMF) 05Open→03Resolved
[00:54:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459 (owner: 10TrainBranchBot)
[01:08:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460
[01:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460 (owner: 10TrainBranchBot)
[01:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:13:33] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be2075 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383530 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:13:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2075 - https://phabricator.wikimedia.org/T383530 (10ops-monitoring-bot) 03NEW
[01:28:29] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460 (owner: 10TrainBranchBot)
[01:57:16] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:58:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:05:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:19:52] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:21:44] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 1.138 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[05:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:26:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:57:16] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:17:16] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:19:21] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:40:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:17:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:24:43] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877)
[07:38:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff)
[07:41:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537 (10MoritzMuehlenhoff) 03NEW
[07:41:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10452062 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:58:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T0800). nyaa~
[08:00:05] <jouncebot>	 MatmaRex and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:05:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:30] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:46] <moritzm>	 !log updated netboot image for bookworm to 12.9 T383537
[08:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:50] <stashbot>	 T383537: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537
[08:28:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:30:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:34:12] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@a81d82c]: build: Updating mediawiki/mediawiki-phan-config to 0.15.1
[08:34:22] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@a81d82c]: build: Updating mediawiki/mediawiki-phan-config to 0.15.1 (duration: 00m 09s)
[08:43:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10452120 (10MoritzMuehlenhoff)
[08:47:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1110718 (https://phabricator.wikimedia.org/T383426)
[08:49:03] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm when comparing with https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/platform/rbac.md#prometheus-rb" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[08:49:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2135.codfw.wmnet
[08:51:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1110718 (https://phabricator.wikimedia.org/T383426) (owner: 10Marostegui)
[08:52:36] <wikibugs>	 (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[08:53:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[08:57:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2135.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:57:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2135.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:57:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:57:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2135.codfw.wmnet
[08:58:17] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10452155 (10Marostegui) a:05Marostegui→03None
[08:59:44] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10452160 (10Marostegui) This is ready for #dc-ops
[09:06:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565#10452169 (10fgiunchedi) Yes there's pki support though it needs to be enabled fleet wide. I'll update the task description
[09:08:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565#10452170 (10fgiunchedi)
[09:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:13:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276)
[09:17:06] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp
[09:18:57] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me, PCC diff has changes for the `tlsCertFile` and `readOnlyPort`" [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[09:23:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[09:25:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:28:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:29:00] <wikibugs>	 (03CR) 10MVernon: [C:03+1] cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[09:30:12] <wikibugs>	 (03CR) 10MVernon: [C:03+1] cassandra: set target_dev to 4.x (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[09:31:41] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, PCC looks as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[09:36:49] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp
[09:37:49] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724
[09:38:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff)
[09:38:22] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.ulsfo.wmnet} and A:cp
[09:38:42] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:38:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:39:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:41:20] <Amir1>	 !log dbmaint on pc5@eqiad (T382948)
[09:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:25] <stashbot>	 T382948: ParserCache is not deleting old rows after three months past the expiry in the secondary datacenter - https://phabricator.wikimedia.org/T382948
[09:42:28] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db2212.codfw.wmnet
[09:45:03] <wikibugs>	 (03PS2) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724
[09:45:36] <wikibugs>	 (03PS2) 10Ladsgroup: Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305)
[09:45:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup)
[09:46:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup)
[09:47:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup)
[09:47:33] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2212.codfw.wmnet
[09:48:10] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]]
[09:48:13] <stashbot>	 T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305
[09:48:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es5 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P71987 and previous config saved to /var/cache/conftool/dbconfig/20250113-094833-marostegui.json
[09:48:36] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[09:48:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[09:48:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1023', diff saved to https://phabricator.wikimedia.org/P71988 and previous config saved to /var/cache/conftool/dbconfig/20250113-094846-marostegui.json
[09:49:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1023.eqiad.wmnet with reason: cloning
[09:49:30] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Reboot
[09:49:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1023.eqiad.wmnet with reason: cloning
[09:49:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Reboot
[09:50:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff)
[09:50:55] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1110725 (https://phabricator.wikimedia.org/T382569)
[09:51:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2042-2044].codfw.wmnet
[09:51:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1110725 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[09:55:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2042-2044].codfw.wmnet
[09:55:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[09:56:46] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.ulsfo.wmnet} and A:cp
[09:56:53] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove es1044 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1110727
[09:57:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Remove es1044 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1110727 (owner: 10Marostegui)
[09:58:08] <wikibugs>	 (03PS3) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724
[09:59:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2042 to wikikube-worker2203
[10:00:03] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:00:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:00:06] <stashbot>	 T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305
[10:00:36] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:00:51] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[10:01:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning
[10:01:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning
[10:02:10] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:02:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1013.eqiad.wmnet with reason: cloning
[10:02:31] <wikibugs>	 (03PS4) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724
[10:02:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1013.eqiad.wmnet with reason: cloning
[10:02:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc2013.codfw.wmnet with reason: cloning
[10:03:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2013.codfw.wmnet with reason: cloning
[10:03:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2042 to wikikube-worker2203 - jelto@cumin1002"
[10:03:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2042 to wikikube-worker2203 - jelto@cumin1002"
[10:04:00] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:04:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2203
[10:04:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2203
[10:05:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2042 to wikikube-worker2203
[10:05:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71989 and previous config saved to /var/cache/conftool/dbconfig/20250113-100554-marostegui.json
[10:05:58] <stashbot>	 T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398
[10:06:44] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for pc2013.codfw.wmnet
[10:07:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff)
[10:07:23] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for pc1015.eqiad.wmnet
[10:07:44] <marostegui>	 !log Upgrade pc2013 pc1015 pc3 dbmaint eqiad codfw T383398
[10:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2043 to wikikube-worker2204
[10:08:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:08:48] <wikibugs>	 (03PS1) 10Marostegui: pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398)
[10:08:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "This LGTM, please note that other check types like http and tcp will have to be fixed (here or in a separate review)" [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[10:09:07] <wikibugs>	 (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui)
[10:09:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2044:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2044 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:10:04] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:10:38] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]] (duration: 22m 28s)
[10:10:42] <stashbot>	 T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305
[10:11:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Fix weights in pc3', diff saved to https://phabricator.wikimedia.org/P71990 and previous config saved to /var/cache/conftool/dbconfig/20250113-101132-marostegui.json
[10:12:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2043 to wikikube-worker2204 - jelto@cumin1002"
[10:12:13] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc2013.codfw.wmnet
[10:13:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2043 to wikikube-worker2204 - jelto@cumin1002"
[10:13:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:13:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2204
[10:13:22] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc1015.eqiad.wmnet
[10:13:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2204
[10:14:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2043 to wikikube-worker2204
[10:14:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2044 to wikikube-worker2205
[10:14:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:15:36] <wikibugs>	 (03PS5) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724)
[10:16:14] <wikibugs>	 (03CR) 10STran: [C:03+2] ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) (owner: 10STran)
[10:16:30] <wikibugs>	 (03CR) 10Marostegui: pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui)
[10:16:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui)
[10:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) (owner: 10STran)
[10:18:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2044 to wikikube-worker2205 - jelto@cumin1002"
[10:18:32] <wikibugs>	 (03PS1) 10Marostegui: pc1013: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/1110731 (https://phabricator.wikimedia.org/T383398)
[10:18:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2044 to wikikube-worker2205 - jelto@cumin1002"
[10:18:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:18:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2205
[10:19:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] pc1013: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/1110731 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui)
[10:20:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove pc1015 from pc3', diff saved to https://phabricator.wikimedia.org/P71991 and previous config saved to /var/cache/conftool/dbconfig/20250113-102047-marostegui.json
[10:21:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Make pc1013 master in pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71992 and previous config saved to /var/cache/conftool/dbconfig/20250113-102152-marostegui.json
[10:21:56] <stashbot>	 T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398
[10:23:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71993 and previous config saved to /var/cache/conftool/dbconfig/20250113-102343-marostegui.json
[10:24:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2205
[10:25:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2044 to wikikube-worker2205
[10:25:37] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[10:25:39] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[10:26:21] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[10:26:24] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[10:27:15] <wikibugs>	 (03PS1) 10Slyngshede: Provide additional information about users [software/bitu] - 10https://gerrit.wikimedia.org/r/1110732 (https://phabricator.wikimedia.org/T383201)
[10:28:07] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:28:22] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:33:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning
[10:33:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning
[10:36:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2203.codfw.wmnet wikikube-worker2204.codfw.wmnet wikikube-worker2205.codfw.wmnet on all recursors
[10:36:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2203.codfw.wmnet wikikube-worker2204.codfw.wmnet wikikube-worker2205.codfw.wmnet on all recursors
[10:36:51] <wikibugs>	 (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414)
[10:38:06] <wikibugs>	 10SRE-swift-storage, 10UploadWizard, 07Unstewarded-production-error, 07Wikimedia-production-error: "Could not store upload in the stash (UploadStashFileException)" for 2.4 GiB TIF file - https://phabricator.wikimedia.org/T285341#10452580 (10MatthewVernon) I'm glad this worked on the second attempt. I'v...
[10:39:47] <wikibugs>	 (03PS2) 10Ladsgroup: mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024)
[10:41:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2203.codfw.wmnet with OS bookworm
[10:41:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71994 and previous config saved to /var/cache/conftool/dbconfig/20250113-104115-root.json
[10:41:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2203
[10:41:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:41:57] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] decom wikikube-worker10[08-10,13,14,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/1109712 (https://phabricator.wikimedia.org/T375842) (owner: 10Kamila Součková)
[10:42:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71995 and previous config saved to /var/cache/conftool/dbconfig/20250113-104250-root.json
[10:43:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2240', diff saved to https://phabricator.wikimedia.org/P71996 and previous config saved to /var/cache/conftool/dbconfig/20250113-104310-marostegui.json
[10:43:22] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259#10452589 (10Volans) @BTullis following up from our chat on [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1104950/2/cookbooks/sre/aqs/__init__.py | this CR ]], when you have a chance le...
[10:43:45] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2240.codfw.wmnet
[10:44:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2203 - jelto@cumin1002"
[10:44:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2203 - jelto@cumin1002"
[10:44:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:44:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2203.codfw.wmnet 165.32.192.10.in-addr.arpa 5.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:44:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2203.codfw.wmnet 165.32.192.10.in-addr.arpa 5.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[10:44:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2203
[10:45:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2203
[10:45:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2203
[10:45:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db2123 [puppet] - 10https://gerrit.wikimedia.org/r/1110737 (https://phabricator.wikimedia.org/T383388)
[10:46:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2123.codfw.wmnet
[10:47:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2123 [puppet] - 10https://gerrit.wikimedia.org/r/1110737 (https://phabricator.wikimedia.org/T383388) (owner: 10Marostegui)
[10:49:34] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2240.codfw.wmnet
[10:50:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[10:54:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2123.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[10:55:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2123.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[10:55:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:55:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2123.codfw.wmnet
[10:55:40] <icinga-wm>	 PROBLEM - SSH on bast2003 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:56:23] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10452668 (10Marostegui) a:05Marostegui→03None
[10:56:26] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10452673 (10Marostegui) This is ready for #dc-ops
[10:56:40] <icinga-wm>	 RECOVERY - SSH on bast2003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:57:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2204.codfw.wmnet with OS bookworm
[10:57:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2204
[10:58:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:58:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1100)
[11:00:22] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2003.codfw.wmnet with reason: os upgrade
[11:00:37] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2003.codfw.wmnet with reason: os upgrade
[11:00:48] <wikibugs>	 06SRE, 10Observability-Logging, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10452687 (10Clement_Goubert) >>! In T187078#10446147, @andrea.denisse wrote: > I think that having a list of the...
[11:01:28] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:01:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2204 - jelto@cumin1002"
[11:01:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2204 - jelto@cumin1002"
[11:01:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:01:37] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2204.codfw.wmnet 164.32.192.10.in-addr.arpa 4.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:01:40] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2204.codfw.wmnet 164.32.192.10.in-addr.arpa 4.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:01:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2204
[11:01:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2204
[11:01:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2204
[11:02:52] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:03:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2205.codfw.wmnet with OS bookworm
[11:03:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2205
[11:03:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:03:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2123 from dbctl for decommission', diff saved to https://phabricator.wikimedia.org/P71997 and previous config saved to /var/cache/conftool/dbconfig/20250113-110333-marostegui.json
[11:03:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71998 and previous config saved to /var/cache/conftool/dbconfig/20250113-110336-root.json
[11:04:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2203.codfw.wmnet with reason: host reimage
[11:05:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2133,2160].codfw.wmnet with reason: cloning
[11:05:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2133,2160].codfw.wmnet with reason: cloning
[11:06:28] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:06:50] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2205 - jelto@cumin1002"
[11:06:53] <wikibugs>	 06SRE, 10Cassandra, 10RESTBase-Cassandra: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424#10452718 (10fgiunchedi) Untagging o11y, AFAIK we haven't seen a reoccurrence of this. Though please reach out if things change
[11:06:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2205 - jelto@cumin1002"
[11:06:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:06:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2205.codfw.wmnet 230.48.192.10.in-addr.arpa 0.3.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:06:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2205.codfw.wmnet 230.48.192.10.in-addr.arpa 0.3.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:06:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2205
[11:07:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2203.codfw.wmnet with reason: host reimage
[11:07:41] <wikibugs>	 (03CR) 10FNegri: "AFAIU the current owners of the views definition are the Data Engineering team, so they should +1 this patch, but I'm not sure who exactly" [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup)
[11:07:51] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[11:08:22] <wikibugs>	 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash, 13Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497#10452724 (10fgiunchedi) 05Open→03Invalid iegreview is gone: {T334415}
[11:09:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2205
[11:09:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2205
[11:09:56] <moritzm>	 !log installing pymysql security updates
[11:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1110741 (https://phabricator.wikimedia.org/T373579)
[11:12:16] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1110741 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[11:14:56] <wikibugs>	 06SRE, 10Observability-Logging, 07Security: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590#10452755 (10fgiunchedi) 05Open→03Declined I'm declining this I don't think it has been a problem in practice
[11:18:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72000 and previous config saved to /var/cache/conftool/dbconfig/20250113-111842-root.json
[11:18:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance
[11:18:59] <wikibugs>	 06SRE, 10Observability-Logging: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682#10452789 (10fgiunchedi) 05Open→03Declined Nowadays we have sampled webrequest available in superset and related dashboards, 5xx feed is in logstash though we could a...
[11:19:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance
[11:19:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:20:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2204.codfw.wmnet with reason: host reimage
[11:20:21] <wikibugs>	 06SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110#10452806 (10fgiunchedi) 05Open→03Declined I'm declining the task since webrequest sampled is available in superset and AFAIK that has been working well for SRE without the need to ac...
[11:20:21] <marostegui>	 !log Move db2160:3322 under db2232 in m2 codfw dbmaint T373579
[11:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:24] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[11:21:05] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414) (owner: 10STran)
[11:22:29] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414) (owner: 10STran)
[11:24:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2204.codfw.wmnet with reason: host reimage
[11:24:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance
[11:24:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance
[11:24:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:25:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove access for aitolkyn" [puppet] - 10https://gerrit.wikimedia.org/r/1110743
[11:25:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744
[11:26:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2203.codfw.wmnet with OS bookworm
[11:28:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2205.codfw.wmnet with reason: host reimage
[11:29:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:31:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2205.codfw.wmnet with reason: host reimage
[11:33:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72002 and previous config saved to /var/cache/conftool/dbconfig/20250113-113347-root.json
[11:38:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng RBAC: Fix prometheus clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[11:40:33] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[11:41:47] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng RBAC: Fix prometheus clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[11:43:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2204.codfw.wmnet with OS bookworm
[11:44:35] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:44:37] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:44:39] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:44:44] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:44:45] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:44:49] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:44:51] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:44:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:44:58] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[11:45:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[11:45:03] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[11:45:10] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[11:45:11] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[11:45:18] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[11:45:20] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:45:23] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:45:24] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:45:28] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:46:49] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[11:48:02] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[11:48:31] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[11:48:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72004 and previous config saved to /var/cache/conftool/dbconfig/20250113-114852-root.json
[11:48:59] <Reedy>	 jouncebot: nowandnext
[11:48:59] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1100)
[11:48:59] <jouncebot>	 In 1 hour(s) and 11 minute(s): Create new tables for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1300)
[11:49:14] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[11:49:18] <wikibugs>	 (03PS1) 10Reedy: Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415)
[11:49:25] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415) (owner: 10Reedy)
[11:49:37] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[11:50:07] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[11:50:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2205.codfw.wmnet with OS bookworm
[11:50:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID due to failed sdy on ms-be2075 - https://phabricator.wikimedia.org/T383530#10452947 (10MatthewVernon) p:05Triage→03High
[11:50:55] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1110743 (owner: 10Muehlenhoff)
[11:51:19] <jayme>	 !log disabling puppet on all hosts running kubelet - T383413
[11:51:21] <jelto>	 !log homer 'lsw1-c6-codfw*' commit 'T377877'
[11:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:22] <stashbot>	 T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413
[11:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:26] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[11:51:36] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10452956 (10MatthewVernon) @Jhancock.wm one of the SSDs in this host looks unhappy now too (T383530), could you get that looked at at the same time, please?
[11:52:27] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubelet: Use the chained certificate for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[11:52:51] <wikibugs>	 (03PS2) 10Ladsgroup: Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491)
[11:53:06] <jelto>	 !log homer 'lsw1-d1-codfw*' commit 'T377877'
[11:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:58] <jelto>	 !log homer 'cr*codfw*' commit 'T377877'
[11:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:31] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 140, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:56:00] <wikibugs>	 (03CR) 10FNegri: Add new file tables to WMCS views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup)
[11:57:55] <jayme>	 !log re-enabling puppet on all hosts running kubelet - T383413
[11:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:58] <stashbot>	 T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413
[11:58:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2203-2205].codfw.wmnet
[11:58:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2203-2205].codfw.wmnet
[11:58:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:00:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10452988 (10Jelto)
[12:02:49] <wikibugs>	 (03Merged) 10jenkins-bot: Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415) (owner: 10Reedy)
[12:02:52] <wikibugs>	 (03PS4) 10Hnowlan: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049)
[12:04:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove access for aitolkyn" [puppet] - 10https://gerrit.wikimedia.org/r/1110743 (owner: 10Muehlenhoff)
[12:05:11] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877)
[12:10:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744
[12:12:19] <wikibugs>	 (03PS1) 10Reedy: Improve error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110757 (https://phabricator.wikimedia.org/T381333)
[12:12:31] <wikibugs>	 (03PS2) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182)
[12:12:31] <wikibugs>	 (03PS2) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182)
[12:13:23] <wikibugs>	 (03PS1) 10Marostegui: dbproxy2005: Change m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579)
[12:13:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744 (owner: 10Muehlenhoff)
[12:17:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet
[12:17:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453015 (10ops-monitoring-bot) depool host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet by kamila@cumin1002 with reason: Decommissioning nodes
[12:18:37] <logmsgbot>	 !log reedy@deploy2002 Synchronized php-1.44.0-wmf.11/extensions/UploadWizard/: T383415 (duration: 13m 05s)
[12:18:40] <stashbot>	 T383415: [wmf.11 - regression] Custom tags not working with UploadWizard - https://phabricator.wikimedia.org/T383415
[12:21:44] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet
[12:21:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453027 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by kamila@cumin1002 depool for host wikikube-worker[1008-1010,1013-1014,1017-1018]...
[12:24:02] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Change m2 master [dns] - 10https://gerrit.wikimedia.org/r/1110763
[12:24:29] <wikibugs>	 (03CR) 10Marostegui: "Once the key is verified, this has my +1" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[12:25:25] <marostegui>	 !log Switch m2-master proxy
[12:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Change m2 master [dns] - 10https://gerrit.wikimedia.org/r/1110763 (owner: 10Marostegui)
[12:26:04] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[12:26:43] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "mvernon@cumin2002:~$ host db2132" [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[12:27:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbproxy2005: Change m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[12:27:23] <marostegui>	 moritzm: ok to merge your change?
[12:27:46] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[12:27:49] <marostegui>	 moritzm: It looks very safe to merge, so mergning
[12:27:51] <marostegui>	 merging 
[12:28:22] <moritzm>	 marostegui: sorry, yes
[12:28:28] <marostegui>	 moritzm: Merged :(
[12:28:30] <marostegui>	 :)
[12:28:33] <moritzm>	 thx :-)
[12:32:18] <wikibugs>	 06SRE: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 (10MoritzMuehlenhoff) 03NEW
[12:32:25] <wikibugs>	 06SRE: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10453048 (10MoritzMuehlenhoff) p:05Triage→03High
[12:42:40] <wikibugs>	 (03PS1) 10Muehlenhoff: codesearch: Remove obsolete apt pinning code for buster [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479)
[12:43:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] decom wikikube-worker10[08-10,13,14,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/1109712 (https://phabricator.wikimedia.org/T375842) (owner: 10Kamila Součková)
[12:47:55] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[12:47:55] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:47:55] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
[12:47:55] <icinga-wm>	 e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:49:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1008-1010].eqiad.wmnet
[12:50:47] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts wikikube-worker[1008-1010].eqiad.wmnet
[12:53:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1008-1010].eqiad.wmnet
[12:57:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:57:40] <jinxer-wm>	 FIRING: [6x] KubernetesRsyslogDown: rsyslog on wikikube-worker1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:59:21] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:00:04] <jouncebot>	 Daimona: It is that lovely time of the day again! You are hereby commanded to deploy Create new tables for the CampaignEvents extension. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1300).
[13:03:31] <Daimona>	 o/
[13:03:37] <cmelo>	 o/
[13:04:29] <Daimona>	 I guess we can get started?
[13:07:01] <cmelo>	 yes
[13:07:52] <wikibugs>	 (03PS1) 10Muehlenhoff: stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557)
[13:08:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[13:10:33] <Daimona>	 !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T379294 T381424
[13:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:38] <stashbot>	 T379294: Create new DB table for storing wikis of event - https://phabricator.wikimedia.org/T379294
[13:10:38] <stashbot>	 T381424: Create DB schema for storing topics of event - https://phabricator.wikimedia.org/T381424
[13:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:14:08] <Daimona>	 cmelo: I have created all tables everywhere. Can you test on metawiki that there's nothing broken?
[13:14:20] <Daimona>	 I'll do testwiki
[13:15:32] <wikibugs>	 (03PS2) 10Muehlenhoff: stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557)
[13:16:51] <Daimona>	 And it seems fine to me
[13:17:23] <cmelo>	 ok
[13:18:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1008-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[13:18:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1008-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[13:18:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:18:41] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1008-1010].eqiad.wmnet
[13:18:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-worker[1008-1010].eqiad.wmnet` - wikikube-worker100...
[13:19:24] <Daimona>	 Is meta ok? If so, we're done
[13:19:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1013-1014,1017-1018].eqiad.wmnet
[13:20:49] <wikibugs>	 (03PS6) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207)
[13:20:55] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:20:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:20:57] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:21:18] <cmelo>	 still testing on meta, sorry having password issues
[13:21:40] <arnaudb>	 just received a page
[13:21:48] <godog>	 checking 
[13:21:55] <godog>	 arnaudb: ^
[13:21:56] <arnaudb>	 !ack 74593
[13:21:56] <sirenbot>	 Attempt to ack incident 74593 failed.
[13:22:03] <godog>	 !incidents
[13:22:03] <sirenbot>	 5588 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[13:22:06] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, verified key with Federico on a quick call." [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[13:22:15] <bblack>	 I guess it did work :)
[13:22:21] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:22:23] <godog>	 I acked it too from the app, maybe that's why
[13:22:30] <arnaudb>	 ah it was the inapp num, mybad haha
[13:23:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:09] <godog>	 !log bounce thanos-query on titan1*
[13:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:50] <cmelo>	 Ok tested!!
[13:25:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:25:34] <Daimona>	 OK great! I'm going to close the tasks then, and see you again here in 35 minutes :)
[13:25:46] <godog>	 !log bounce thanos-store on titan1*
[13:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:49] <cmelo>	 thank you!!!
[13:25:50] <wikibugs>	 (03PS2) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791)
[13:27:04] <godog>	 actually I'll depool eqiad from thanos
[13:27:24] <logmsgbot>	 !log filippo@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-swift,name=eqiad
[13:27:54] <logmsgbot>	 !log filippo@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-swift,name=eqiad
[13:28:03] <logmsgbot>	 !log filippo@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad
[13:28:15] <godog>	 that was my bad, I depooled thanos-swift not thanos-query, now fixed
[13:28:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:31:10] <wikibugs>	 (03PS2) 10Jelto: Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877)
[13:31:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney)
[13:31:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:31:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:32:39] <godog>	 should recover soon
[13:33:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:35:00] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos)
[13:35:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:38:09] <godog>	 bblack arnaudb I suspect root cause was a query of death, I'll dig deeper shortly and going back to lunch
[13:38:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:38:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2134,2160,2234].codfw.wmnet with reason: maintenance
[13:38:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2134,2160,2234].codfw.wmnet with reason: maintenance
[13:39:17] <icinga-wm>	 PROBLEM - Host dbprov2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:23] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos)
[13:40:03] <wikibugs>	 (03CR) 10Jforrester: "Oops. Yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup)
[13:40:42] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos)
[13:40:44] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10453212 (10MatthewVernon) @VRiley-WMF when I look at the system now, the OS sees the extra disk (since 19:02:51 on 10 Jan, a few minutes after a reboot?). So I'm not sure what you...
[13:41:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[13:41:45] <icinga-wm>	 RECOVERY - Host dbprov2003 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms
[13:42:21] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:43:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:44:09] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2234 [puppet] - 10https://gerrit.wikimedia.org/r/1110772 (https://phabricator.wikimedia.org/T373579)
[13:45:07] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2234 [puppet] - 10https://gerrit.wikimedia.org/r/1110772 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[13:46:13] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[13:48:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1013-1014,1017-1018].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[13:48:23] <wikibugs>	 (03PS1) 10Slyngshede: Notify managers of closed requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1110773
[13:48:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1013-1014,1017-1018].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[13:48:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:48:46] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1013-1014,1017-1018].eqiad.wmnet
[13:48:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453228 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-worker[1013-1014,1017-1018].eqiad.wmnet` - wikiku...
[13:49:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[13:50:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Amir, as you are the onboarding buddy, can you merge and deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[13:51:09] <wikibugs>	 (03CR) 10Ladsgroup: "Sure. I want to check with Moritz quickly and then check the key oob" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[13:52:10] <kamila_>	 !log homer cr*eqiad* commit 'wikikube decoms'
[13:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72006 and previous config saved to /var/cache/conftool/dbconfig/20250113-135410-root.json
[13:54:58] <kamila_>	 I caught someone's homer change again: something about `2a02:ec80:a000:fe01::1/64`, can I commit?
[13:55:22] <kamila_>	 diff: https://www.irccloud.com/pastebin/9ti08ybs/
[13:58:52] <elukey>	 kamila_: o/ for what device?
[13:59:26] <kamila_>	 elukey: cr1-eqiad
[13:59:39] <elukey>	 I can check commits but last week topranks was adding configs for cloud, I guess it is safe but I can tell you in a sec
[13:59:54] <kamila_>	 thanks a ton!
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1400). Please do the needful.
[14:00:05] <jouncebot>	 Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:21] <Daimona>	 o/
[14:02:05] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] prometheus::k8s: Move away from kubelet readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[14:02:42] <topranks>	 kamila_: sorry, I added so many addresses in Netbox on Friday I must have forgot that one 
[14:02:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2040-2041].codfw.wmnet
[14:02:57] <topranks>	 it's ok to proceed thank you :) 
[14:03:01] <kamila_>	 no worries topranks :-)
[14:03:07] <kamila_>	 thanks!
[14:03:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[14:03:21] <elukey>	 ok perfect :)
[14:04:24] <logmsgbot>	 !log filippo@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad
[14:04:52] <Lucas_WMDE>	 I’m a bit busy rn but I can deploy if nobody else is available
[14:04:59] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:05:19] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:05:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[14:05:55] <cmelo>	 o/
[14:06:15] <HouseOfM>	 o/
[14:06:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2040-2041].codfw.wmnet
[14:07:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts eventlog1003.eqiad.wmnet
[14:07:33] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[14:08:06] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[14:09:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72008 and previous config saved to /var/cache/conftool/dbconfig/20250113-140916-root.json
[14:09:48] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[14:10:04] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:10:40] <wikibugs>	 (03CR) 10Muehlenhoff: "Just to close the loop; good to merge given the onboarding buddy thinks the onboarding has proceeded to the state where global root makes " [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[14:11:12] <wikibugs>	 (03PS1) 10Btullis: Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276)
[14:12:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276) (owner: 10Muehlenhoff)
[14:13:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2040 to wikikube-worker2206
[14:14:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[14:14:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Late to the party, but thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276) (owner: 10Muehlenhoff)
[14:15:04] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:34] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2041 to wikikube-worker2207
[14:17:01] <Lucas_WMDE>	 alright, I should be able to deploy now
[14:17:04] <Daimona>	 No other deployers around I assume?
[14:17:11] <Daimona>	 Oh
[14:17:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2040 to wikikube-worker2206 - jelto@cumin1002"
[14:17:41] <Daimona>	 Lucas_WMDE: thank you! I'm sorry that it's always on you
[14:17:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2040 to wikikube-worker2206 - jelto@cumin1002"
[14:17:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2206
[14:18:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[14:18:20] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2206
[14:18:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) (owner: 10Daimona Eaytoy)
[14:18:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2040 to wikikube-worker2206
[14:19:04] <wikibugs>	 (03Merged) 10jenkins-bot: prod: Enable $wgCampaignEventsEnableEventWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) (owner: 10Daimona Eaytoy)
[14:19:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]]
[14:19:26] <stashbot>	 T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078
[14:21:25] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov2003.codfw.wmnet: Renew puppet certificate - root@cumin1002
[14:21:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2041 to wikikube-worker2207 - jelto@cumin1002"
[14:21:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2041 to wikikube-worker2207 - jelto@cumin1002"
[14:21:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2207
[14:22:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2207
[14:22:22] <wikibugs>	 (03PS1) 10Ottomata: Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776
[14:22:42] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[14:22:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2041 to wikikube-worker2207
[14:23:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776 (owner: 10Ottomata)
[14:23:31] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:24:02] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:24:13] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov2003.codfw.wmnet: Renew puppet certificate - root@cumin1002
[14:24:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72009 and previous config saved to /var/cache/conftool/dbconfig/20250113-142421-root.json
[14:24:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:24:26] <stashbot>	 T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078
[14:24:29] <Lucas_WMDE>	 Daimona: please test :)
[14:25:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10453348 (10Kgraessle) >>! In T383241#10450029, @Dzahn wrote: >> no such identity: /Users/katherinegraessle/.ssh/prod.key: No such file or directory >  > It is tryin...
[14:25:03] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:25:03] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts eventlog1003.eqiad.wmnet
[14:25:34] <wikibugs>	 (03PS1) 10Ottomata: Revert^2 "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110777
[14:25:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis)
[14:25:57] <wikibugs>	 (03Abandoned) 10Ottomata: Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776 (owner: 10Ottomata)
[14:26:00] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis)
[14:26:06] <Daimona>	 I can do testwiki again, maybe cmelo you can do meta and HouseOfM you take officewiki?
[14:26:25] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy)
[14:26:37] <Daimona>	 Wait
[14:26:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2206.codfw.wmnet wikikube-worker2207.codfw.wmnet on all recursors
[14:26:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2206.codfw.wmnet wikikube-worker2207.codfw.wmnet on all recursors
[14:26:52] <Daimona>	 Look around Ted, you're all alone
[14:27:02] <Lucas_WMDE>	 ohno
[14:27:22] <Daimona>	 Well I'm going to do testwiki for the time being :D
[14:29:27] <Lucas_WMDE>	 mwdebug logstash looks clear so far
[14:29:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis)
[14:30:04] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[14:31:47] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[14:33:37] <wikibugs>	 (03PS3) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756)
[14:34:56] <Daimona>	 I've done somewhat more extensive tests on testwiki and it looks good
[14:35:09] <Daimona>	 But beta logstash seems broken
[14:35:10] <wikibugs>	 (03PS1) 10Volans: enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778
[14:36:16] <Daimona>	 Ah, SNAFU, I see https://phabricator.wikimedia.org/T346402
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2206.codfw.wmnet with OS bookworm
[14:37:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2207.codfw.wmnet with OS bookworm
[14:37:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2206
[14:37:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[14:37:40] <Daimona>	 wait this isn't beta
[14:38:01] <Daimona>	 nvm, not enough caffeine
[14:39:04] <Lucas_WMDE>	 I was about to ask how this was relevant ^^
[14:39:21] <Lucas_WMDE>	 still nothing in mwdebug logstash
[14:39:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72010 and previous config saved to /var/cache/conftool/dbconfig/20250113-143926-root.json
[14:39:29] <Lucas_WMDE>	 one warning “inconsistent revision ID” and one info about executing pygmentize
[14:39:47] <Lucas_WMDE>	 hm
[14:39:52] <Lucas_WMDE>	 but the warning does come from https://test.wikipedia.org/wiki/Event:T380078?action=edit&veswitched=1
[14:39:53] <stashbot>	 T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078
[14:39:58] <Lucas_WMDE>	 that’s an awfully suspicious page title isn’t it
[14:40:24] * Lucas_WMDE codesearches
[14:40:34] <logmsgbot>	 !log otto@deploy2002 Started deploy [analytics/refinery@f3945ee] (hadoop-test): gobblin eventlogging_legacy - use EventStreamConfig to pull topics
[14:40:41] <Daimona>	 Yeah sorry, I just opened the right logstash :D Indeed, no errors. That's the page I was using to test, and I'm pretty sure that error happens all the time in prod
[14:40:41] <Lucas_WMDE>	 apparently it comes from this https://gerrit.wikimedia.org/g/mediawiki/core/+/f1f6f7cfe6494fb05b8f626b829897b84c0217d8/includes/parser/ParserCache.php#460
[14:40:53] <logmsgbot>	 !log dcausse@deploy2002 Started deploy [airflow-dags/search@8c96899]: search: fix glent, import_cirrus_indexes and transfer_to_es
[14:41:00] <logmsgbot>	 !log otto@deploy2002 Finished deploy [analytics/refinery@f3945ee] (hadoop-test): gobblin eventlogging_legacy - use EventStreamConfig to pull topics (duration: 01m 37s)
[14:41:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2206 - jelto@cumin1002"
[14:41:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2206 - jelto@cumin1002"
[14:41:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:41:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2206.codfw.wmnet 167.32.192.10.in-addr.arpa 7.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:41:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2206.codfw.wmnet 167.32.192.10.in-addr.arpa 7.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:41:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2206
[14:41:12] <Lucas_WMDE>	 you’re right, it’s the #5 entry on mediawiki-warnings
[14:41:14] <Lucas_WMDE>	 probably okay to ignore then
[14:41:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2206
[14:41:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2206
[14:41:32] <Lucas_WMDE>	 do you want to wait for the others or is it okay to deploy?
[14:41:56] <jayme>	 !log disabling puppet on all hosts running kubelet - T383413
[14:41:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:59] <stashbot>	 T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413
[14:42:01] <Daimona>	 Yeah it's OK but I want to make sure that there's a task for it, because 8k errors in 15 minutes is definitely spam
[14:42:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2207
[14:42:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[14:42:37] <logmsgbot>	 !log dcausse@deploy2002 Finished deploy [airflow-dags/search@8c96899]: search: fix glent, import_cirrus_indexes and transfer_to_es (duration: 01m 44s)
[14:42:45] <Lucas_WMDE>	 T358708 apparently?
[14:42:45] <stashbot>	 T358708: Inconsistent revision ID - https://phabricator.wikimedia.org/T358708
[14:43:01] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubelet: Disable the readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm)
[14:43:43] <Daimona>	 Yup just got there. I'm going to comment because from the task it isn't clear what the volume of these warnings is
[14:43:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync
[14:43:56] <Lucas_WMDE>	 ok
[14:44:33] <logmsgbot>	 !log otto@deploy2002 Started deploy [analytics/refinery@f3945ee]: gobblin eventlogging_legacy - use EventStreamConfig to pull topics
[14:45:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2207 - jelto@cumin1002"
[14:45:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2207 - jelto@cumin1002"
[14:45:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:45:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2207.codfw.wmnet 166.32.192.10.in-addr.arpa 6.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:45:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2207.codfw.wmnet 166.32.192.10.in-addr.arpa 6.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[14:45:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2207
[14:46:01] <logmsgbot>	 !log otto@deploy2002 Finished deploy [analytics/refinery@f3945ee]: gobblin eventlogging_legacy - use EventStreamConfig to pull topics (duration: 01m 27s)
[14:47:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2207
[14:47:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2207
[14:47:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Looks good based on the description but I'll need to take your word this is the required fix :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 (owner: 10Volans)
[14:48:28] <jayme>	 !log re-enabling puppet on all hosts running kubelet - T383413
[14:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:31] <stashbot>	 T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413
[14:49:19] <moritzm>	 !log installing glibc bugfix updates for Bookworm
[14:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:23] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10453489 (10cmooney) @dcaro is there anything left to be done here?  I see traffic profiled in the low and high classes across the cloud switc...
[14:51:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]] (duration: 32m 04s)
[14:51:30] <stashbot>	 T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078
[14:52:10] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:50] <Daimona>	 Noice, thank you!
[14:53:25] <cmelo>	 Thank you!
[14:53:44] <wikibugs>	 (03CR) 10Volans: [C:03+1] "As I was not involved directly in Federico's onboarding, my +1 is purely on the key verification part :)" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[14:53:47] <HouseOfM>	 :) wonderful as always Lucas_WMDE
[14:54:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72011 and previous config saved to /var/cache/conftool/dbconfig/20250113-145432-root.json
[14:55:18] <wikibugs>	 (03CR) 10FNegri: "Removing my -1 after discussing with Joanna and the rest of the WMCS team. While we might need more permission levels for other people in " [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) (owner: 10FNegri)
[14:55:25] <wikibugs>	 (03PS2) 10Volans: enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778
[14:55:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453508 (10fnegri) 05Declined→03Open Reopening after discussing with @joanna_borun and the rest of the WMCS team. Whi...
[14:55:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q3-Q4), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453510 (10fnegri)
[14:55:50] <Lucas_WMDE>	 np :)
[14:57:38] <wikibugs>	 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 2 others: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10453514 (10Tgr) a:03Tgr
[14:58:51] <logmsgbot>	 !log btullis@deploy2002 Started deploy [airflow-dags/search@8c96899]: (no justification provided)
[14:59:05] <wikibugs>	 (03PS1) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230)
[14:59:13] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [airflow-dags/search@8c96899]: (no justification provided) (duration: 00m 24s)
[14:59:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2206.codfw.wmnet with reason: host reimage
[15:01:36] <wikibugs>	 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 2 others: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10453543 (10Tgr) Notes from @elukey on IRC: > 17:12 < elukey> IIUC the config needs to run on the deployment servers via puppet run, so the correspondent yaml files for he...
[15:02:09] <wikibugs>	 (03PS1) 10Clément Goubert: mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943)
[15:03:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2206.codfw.wmnet with reason: host reimage
[15:06:02] <wikibugs>	 (03CR) 10Btullis: "There are also references in site.pp and preseed.yaml as well as hieradata/role/common/kafka/jumbo/broker.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[15:06:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2207.codfw.wmnet with reason: host reimage
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert)
[15:09:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2207.codfw.wmnet with reason: host reimage
[15:09:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert)
[15:11:04] <wikibugs>	 (03Merged) 10jenkins-bot: mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert)
[15:11:56] <wikibugs>	 (03CR) 10Ladsgroup: "A lot of those parts have been done and the rest will be done in pair sessions. Given the seniority, I feel it's okay to go directly to th" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[15:12:43] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[15:14:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[15:15:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[15:16:19] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[15:16:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[15:21:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383242#10453639 (10Jhancock.wm) 05Open→03Declined side effect of T383225
[15:21:54] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Thanks!  Couple of minor comments inline" [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli)
[15:21:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383307#10453646 (10Jhancock.wm) 05Open→03Declined side effect of T383225
[15:22:37] <wikibugs>	 (03PS4) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756)
[15:23:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2206.codfw.wmnet with OS bookworm
[15:23:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2234.codfw.wmnet with reason: maintenance
[15:23:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2234.codfw.wmnet with reason: maintenance
[15:28:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570)
[15:28:12] <marostegui>	 sudo dbctl instance db2128 depool
[15:28:12] <marostegui>	 sudo dbctl config commit -m "Depool db2128 T383572"
[15:28:13] <stashbot>	 T383572: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572
[15:28:18] <marostegui>	 Great :)
[15:28:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2128 T383572', diff saved to https://phabricator.wikimedia.org/P72012 and previous config saved to /var/cache/conftool/dbconfig/20250113-152828-marostegui.json
[15:28:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2207.codfw.wmnet with OS bookworm
[15:29:05] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1110799 (https://phabricator.wikimedia.org/T383572)
[15:29:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1110799 (https://phabricator.wikimedia.org/T383572) (owner: 10Marostegui)
[15:30:17] <jelto>	 !log homer 'lsw1-c5-codfw*' commit 'T377877'
[15:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:21] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[15:30:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2128 from dbctl T383572', diff saved to https://phabricator.wikimedia.org/P72013 and previous config saved to /var/cache/conftool/dbconfig/20250113-153046-marostegui.json
[15:31:29] <wikibugs>	 (03PS1) 10Marostegui: db2128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110800 (https://phabricator.wikimedia.org/T383572)
[15:31:34] <jelto>	 !log homer 'cr*codfw*' commit 'T377877'
[15:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110800 (https://phabricator.wikimedia.org/T383572) (owner: 10Marostegui)
[15:32:12] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 136, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:32:32] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbprov2004.codfw.wmnet with reason: reboot
[15:32:46] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbprov2004.codfw.wmnet with reason: reboot
[15:32:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2206-2207].codfw.wmnet
[15:33:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2206-2207].codfw.wmnet
[15:33:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10453759 (10Jelto)
[15:35:09] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[15:37:45] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: k8s instances migration to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[15:39:50] <wikibugs>	 10SRE-tools, 06Data-Persistence-Automations, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10453800 (10ABran-WMF) a:05ABran-WMF→03None
[15:40:48] <wikibugs>	 (03PS1) 10Jelto: Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877)
[15:41:10] <icinga-wm>	 PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:41:24] <wikibugs>	 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10453812 (10fgiunchedi)
[15:41:54] <wikibugs>	 (03PS1) 10Marostegui: db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110803 (https://phabricator.wikimedia.org/T374623)
[15:42:02] <wikibugs>	 (03PS3) 10Federico Ceratto: Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716
[15:42:04] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[15:42:05] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto)
[15:42:10] <icinga-wm>	 RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:42:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110803 (https://phabricator.wikimedia.org/T374623) (owner: 10Marostegui)
[15:42:22] <wikibugs>	 (03PS5) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756)
[15:43:19] <wikibugs>	 06SRE, 06Data-Platform-SRE, 10Observability-Metrics, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761#10453823 (10fgiunchedi) 05Open→03Invalid superset has moved to k8s in the meantime, this task doesn't apply anymore
[15:43:52] <wikibugs>	 (03PS6) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756)
[15:44:15] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbprov2005.codfw.wmnet with reason: os upgrade
[15:44:32] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbprov2005.codfw.wmnet with reason: os upgrade
[15:44:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[15:46:43] <wikibugs>	 (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[15:47:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2416-2419].codfw.wmnet
[15:47:14] <wikibugs>	 (03PS7) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756)
[15:48:34] <wikibugs>	 (03PS2) 10Dzahn: Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730)
[15:48:40] <wikibugs>	 (03PS3) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791)
[15:51:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1041 as es4 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P72014 and previous config saved to /var/cache/conftool/dbconfig/20250113-155135-marostegui.json
[15:51:39] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[15:51:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn)
[15:51:46] <wikibugs>	 (03PS2) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230)
[15:51:50] <wikibugs>	 (03CR) 10Ottomata: "oo nice catch." [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[15:51:51] <logmsgbot>	 !log dzahn@dns1006 START - running authdns-update
[15:51:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1020 T382569', diff saved to https://phabricator.wikimedia.org/P72015 and previous config saved to /var/cache/conftool/dbconfig/20250113-155153-marostegui.json
[15:52:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2416-2419].codfw.wmnet
[15:52:26] <mutante>	 !log DNS - removing uz.wikimedia.org - wiki was never created (T382730)
[15:52:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:29] <stashbot>	 T382730: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730
[15:52:46] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[15:52:48] <wikibugs>	 (03CR) 10Ottomata: "I wasn't sure if I should remove the ones in e.g. modules/profile/files/sre/bullseye.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[15:53:08] <wikibugs>	 (03PS1) 10Marostegui: es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110805 (https://phabricator.wikimedia.org/T383199)
[15:53:36] <logmsgbot>	 !log dzahn@dns1006 END - running authdns-update
[15:53:49] <mutante>	 !log DNS - removing uz.wikimedia.org - wiki was never created (T270987)
[15:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:54] <stashbot>	 T270987: Create a wiki for Wikimedians of the Uzbek language User Group - https://phabricator.wikimedia.org/T270987
[15:54:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110805 (https://phabricator.wikimedia.org/T383199) (owner: 10Marostegui)
[15:55:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2416 to wikikube-worker2208
[15:55:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[15:55:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[15:55:28] <icinga-wm>	 status
[15:57:21] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10453986 (10dcaro) >>! In T371501#10453489, @cmooney wrote: > @dcaro is there anything left to be done here?  I see traffic profiled in the lo...
[15:57:53] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302#10453997 (10cmooney) p:05Triage→03Low
[15:57:55] <wikibugs>	 (03PS1) 10Ottomata: logstash - remove legacy eventlogging related input and filters [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230)
[15:57:55] <wikibugs>	 (03CR) 10Tiziano Fogli: "I also moved the new alert to a new file that is globally deployed (i.e., on Thanos)." [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli)
[15:58:12] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[15:58:12] <icinga-wm>	 status
[15:58:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:58:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2416 to wikikube-worker2208 - jelto@cumin1002"
[15:59:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli)
[15:59:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2416 to wikikube-worker2208 - jelto@cumin1002"
[15:59:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:59:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2208
[15:59:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2208
[16:00:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2416 to wikikube-worker2208
[16:01:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2417 to wikikube-worker2209
[16:01:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:02:46] <wikibugs>	 (03PS3) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230)
[16:04:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2417 to wikikube-worker2209 - jelto@cumin1002"
[16:05:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková)
[16:05:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2417 to wikikube-worker2209 - jelto@cumin1002"
[16:05:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:05:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2209
[16:05:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2209
[16:05:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw2418:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:06:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2417 to wikikube-worker2209
[16:06:06] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303#10454034 (10cmooney) p:05Triage→03Medium a:03cmooney Thanks for the task.  It's firing because the fasw switch interfaces are enabled but not...
[16:08:10] <wikibugs>	 (03CR) 10Muehlenhoff: "These should stay" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[16:08:58] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303#10454052 (10cmooney)
[16:08:59] <wikibugs>	 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10454053 (10cmooney)
[16:09:03] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, only thing missing is the record in manifests/site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[16:09:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2418 to wikikube-worker2210
[16:09:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:09:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[16:09:45] <icinga-wm>	 PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:16] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:12:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2418 to wikikube-worker2210 - jelto@cumin1002"
[16:13:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2418 to wikikube-worker2210 - jelto@cumin1002"
[16:13:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:13:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2210
[16:13:19] <icinga-wm>	 RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms
[16:13:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2210
[16:13:38] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli)
[16:14:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2418 to wikikube-worker2210
[16:14:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2419 to wikikube-worker2211
[16:14:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:14:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:14:52] <wikibugs>	 (03Merged) 10jenkins-bot: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli)
[16:18:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2419 to wikikube-worker2211 - jelto@cumin1002"
[16:19:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2419 to wikikube-worker2211 - jelto@cumin1002"
[16:19:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:19:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2211
[16:19:21] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2211
[16:19:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10454097 (10MoritzMuehlenhoff)
[16:19:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:20:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2419 to wikikube-worker2211
[16:20:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2208.codfw.wmnet wikikube-worker2209.codfw.wmnet wikikube-worker2210.codfw.wmnet wikikube-worker2211.codfw.wmnet on all recursors
[16:20:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2208.codfw.wmnet wikikube-worker2209.codfw.wmnet wikikube-worker2210.codfw.wmnet wikikube-worker2211.codfw.wmnet on all recursors
[16:23:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2208.codfw.wmnet with OS bookworm
[16:23:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2208
[16:23:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:27:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2208 - jelto@cumin1002"
[16:27:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2208 - jelto@cumin1002"
[16:27:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2208.codfw.wmnet 63.32.192.10.in-addr.arpa 3.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:27:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2208.codfw.wmnet 63.32.192.10.in-addr.arpa 3.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:27:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2208
[16:27:29] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add cloud-private v6 supernets [puppet] - 10https://gerrit.wikimedia.org/r/1109983 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah)
[16:27:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2208
[16:27:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2208
[16:28:30] <wikibugs>	 (03PS4) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230)
[16:28:35] <wikibugs>	 (03CR) 10Ottomata: "Oh! VM is removed.  Done." [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[16:29:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[16:30:05] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1630).
[16:30:05] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[16:30:31] <wikibugs>	 (03PS1) 10Marostegui: db2232: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110811
[16:30:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10454201 (10phaultfinder)
[16:30:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2232: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110811 (owner: 10Marostegui)
[16:32:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2209.codfw.wmnet with OS bookworm
[16:32:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2209
[16:32:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:35:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2209 - jelto@cumin1002"
[16:35:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2209 - jelto@cumin1002"
[16:35:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:35:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2209.codfw.wmnet 64.32.192.10.in-addr.arpa 4.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:35:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2209.codfw.wmnet 64.32.192.10.in-addr.arpa 4.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:35:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2209
[16:35:57] <wikibugs>	 (03PS3) 10JMeybohm: k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984)
[16:35:58] <wikibugs>	 (03PS1) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[16:36:10] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2209
[16:36:10] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2209
[16:38:00] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[16:38:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2210.codfw.wmnet with OS bookworm
[16:38:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2210
[16:38:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:42:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10454264 (10Jhancock.wm) 05Open→03Resolved
[16:43:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2210 - jelto@cumin1002"
[16:43:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2210 - jelto@cumin1002"
[16:43:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:43:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2210.codfw.wmnet 65.32.192.10.in-addr.arpa 5.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:43:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2210.codfw.wmnet 65.32.192.10.in-addr.arpa 5.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:43:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2210
[16:44:00] <wikibugs>	 (03PS1) 10DLynch: Set Flow to read-only on phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834)
[16:44:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2208.codfw.wmnet with reason: host reimage
[16:44:46] <wikibugs>	 (03CR) 10DLynch: "This *doesn't* include cawiki and mediawikiwiki because they need further processing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch)
[16:45:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch)
[16:45:44] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10454289 (10elukey) Tried to copy the storcli64 binary to ms and presto nodes, these are the results:  ` elukey@ms...
[16:46:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2210
[16:46:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2210
[16:46:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10454291 (10Kgraessle) @Dzahn   I had a typo in my ~/.ssh/config, please disregard my last comment.   This is working and I am able to connect successfully. We can c...
[16:48:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2208.codfw.wmnet with reason: host reimage
[16:48:11] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10454316 (10elukey)
[16:49:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10454321 (10Dzahn) 05Open→03Resolved a:03Dzahn @Kgraessle Perfect!  Great to hear it works and thanks for the update.
[16:51:00] <icinga-wm>	 PROBLEM - Host dbprov2005 is DOWN: PING CRITICAL - Packet loss = 100%
[16:51:28] <jynus>	 that's me, expired downtime
[16:51:31] <jynus>	 ignore
[16:51:46] <jynus>	 should be up soon
[16:51:48] <icinga-wm>	 RECOVERY - Host dbprov2005 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[16:52:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2209.codfw.wmnet with reason: host reimage
[16:53:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2211.codfw.wmnet with OS bookworm
[16:53:58] <cdanis>	 jouncebot: nowandnext
[16:53:58] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1630)
[16:53:58] <jouncebot>	 In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800)
[16:53:58] <jouncebot>	 In 1 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800)
[16:54:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2211
[16:54:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[16:56:10] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov2005.codfw.wmnet: Renew puppet certificate - root@cumin1002
[16:57:00] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2209.codfw.wmnet with reason: host reimage
[16:58:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2211 - jelto@cumin1002"
[16:58:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2211 - jelto@cumin1002"
[16:58:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:58:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2211.codfw.wmnet 66.32.192.10.in-addr.arpa 6.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:58:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2211.codfw.wmnet 66.32.192.10.in-addr.arpa 6.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:58:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2211
[16:58:39] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:58:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2211
[16:58:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2211
[16:59:06] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov2005.codfw.wmnet: Renew puppet certificate - root@cumin1002
[16:59:32] <wikibugs>	 (03PS2) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:00:19] <wikibugs>	 (03CR) 10CDanis: OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[17:02:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[17:03:07] <wikibugs>	 (03PS3) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:03:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2210.codfw.wmnet with reason: host reimage
[17:06:17] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf.json.erb: Update whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1110819
[17:06:39] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[17:06:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2210.codfw.wmnet with reason: host reimage
[17:07:50] <wikibugs>	 (03PS1) 10Scott French: mediawiki: enable mesh telemetry in mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818
[17:08:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2208.codfw.wmnet with OS bookworm
[17:09:49] <cdanis>	 jouncebot: nowandnext
[17:09:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 50 minute(s)
[17:09:50] <jouncebot>	 In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800)
[17:09:50] <jouncebot>	 In 0 hour(s) and 50 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800)
[17:09:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[17:10:11] <claime>	 let's get tracing
[17:10:44] <wikibugs>	 (03Merged) 10jenkins-bot: OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[17:11:03] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]]
[17:11:07] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[17:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:13:04] <wikibugs>	 (03PS3) 10Dzahn: Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730)
[17:14:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "manual rebase" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn)
[17:14:17] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn)
[17:15:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2211.codfw.wmnet with reason: host reimage
[17:15:38] <wikibugs>	 (03PS4) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:15:51] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:15:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2209.codfw.wmnet with OS bookworm
[17:16:58] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Continuing with sync
[17:16:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn)
[17:17:16] <logmsgbot>	 !log dzahn@dns1006 START - running authdns-update
[17:18:18] <mutante>	 !log DNS - removing za.wikimedia.org and za.m.wikimedia.org - wiki was not created (T382730, T195926)
[17:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:24] <stashbot>	 T382730: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730
[17:18:24] <stashbot>	 T195926: Create wiki for Wikimedia South Africa - https://phabricator.wikimedia.org/T195926
[17:18:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2211.codfw.wmnet with reason: host reimage
[17:19:03] <logmsgbot>	 !log dzahn@dns1006 END - running authdns-update
[17:19:10] <wikibugs>	 (03PS5) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:21:55] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10454523 (10Dzahn) 05In progress→03Resolved @Dylsss Thanks for reporting this!  The 2 DNS r...
[17:24:01] <wikibugs>	 (03PS6) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:24:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10454539 (10phaultfinder)
[17:25:04] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] (duration: 14m 00s)
[17:25:08] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[17:25:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:25:48] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10454546 (10Dylsss) Thanks for actioning!
[17:26:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2210.codfw.wmnet with OS bookworm
[17:28:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 7 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[17:28:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10454562 (10Jhancock.wm) @Jelto reseated all the cables on the backplane. give it another go and let me know if it needs another look.
[17:29:14] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10454563 (10matmarex)
[17:30:02] <wikibugs>	 (03PS7) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:31:41] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10454584 (10matmarex) Log for an example request: https://beta-logs.wmcloud.org/goto/a404432dceca139889a...
[17:32:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "ran puppet on acmechief2002 and it looked fine. it added to /etc/acme-chief/config.yaml and refreshed acme-chief service" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[17:33:19] <vgutierrez>	 mutante: acme-chief runs on 1002 :)
[17:33:34] <vgutierrez>	 I triggered a puppet run there, and the certificate has been issued
[17:33:50] <mutante>	 vgutierrez: ack, I picked a random one from output of  "cumin acme*", just wanted to seen one puppet run to work after merge
[17:34:00] <mutante>	 vgutierrez: thanks, ok:)
[17:34:49] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-5 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:34:55] <vgutierrez>	 uh?
[17:35:00] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:35:06] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-6 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:35:10] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:35:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-4 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:35:12] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[17:35:13] <sukhe>	 hmm
[17:35:22] <mutante>	 awwwr
[17:35:30] <vgutierrez>	 nginx crashed on ncredir1001
[17:35:34] <sukhe>	 Jan 13 17:32:23 ncredir1001 nginx[2048770]: 2025/01/13 17:32:23 [warn] 2048770#2048770: could not build optimal map_hash, you should increase either map_hash_max_size: 2048 or map_hash_bucket_size: 64; ignoring map_hash_bucket_size
[17:35:55] <vgutierrez>	 Jan 13 17:32:23 ncredir1001 nginx[2048770]: 2025/01/13 17:32:23 [emerg] 2048770#2048770: BIO_new_file("/etc/acmecerts/non-canonical-redirect-7/live/ec-prime256v1.ocsp") failed (SSL: error:80000002:system library::No such file or directory:calling fopen(/etc/acmecerts/non-canonical-redirect-7/live/ec-prime256v1.ocsp, rb)
[17:35:57] <mutante>	 soo. a new section was added
[17:36:02] <mutante>	 section 7
[17:36:04] <vgutierrez>	 bad timing?
[17:36:27] <vgutierrez>	 let me trigger a puppet run on ncredir1001
[17:36:40] <mutante>	 it cant handle more than 8 certs?
[17:36:45] <mutante>	 0 to 7 or something
[17:37:48] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-5 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 289931 seconds left:Certificate wikimedia.is valid until 2025-04-06 06:57:02 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:37:54] <mutante>	 also looking at puppet, but ncredir1002
[17:37:55] <sukhe>	 vgutierrez: ^ puppet run?
[17:37:58] <vgutierrez>	 yes sukhe 
[17:38:00] <mutante>	 looks good
[17:38:02] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 237178 seconds left:Certificate wikipedia.com valid until 2025-03-30 22:53:54 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:38:06] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-6 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 554813 seconds left:Certificate wikipedia.fi valid until 2025-02-27 05:38:56 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:38:10] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-3 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 560869 seconds left:Certificate *.wikipedia.bg valid until 2025-02-07 03:20:41 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:38:12] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-4 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 227807 seconds left:Certificate *.wikispecies.net valid until 2025-03-21 05:49:19 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:38:12] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-2 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 456107 seconds left:Certificate *.wikimania.com valid until 2025-03-21 07:49:44 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/Ncredir
[17:38:17] <mutante>	 phew:)
[17:38:19] <sukhe>	 vgutierrez: nice thanks
[17:38:22] <vgutierrez>	 so nginx tried to configure non-canonical-redirect-7 before acme-chief deployed it there
[17:38:24] <vgutierrez>	 :]
[17:38:30] <sukhe>	 aaa
[17:38:38] <mutante>	 aah
[17:38:48] <vgutierrez>	 https://www.irccloud.com/pastebin/gr2fWV0O/
[17:38:53] <vgutierrez>	 pretty bad race condition
[17:38:56] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10454634 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:39:15] <vgutierrez>	 it looks like puppet run on ncredir1001 during non-canonical-redirect-7 issuance process
[17:39:23] <vgutierrez>	 so it got the snakeoil cert rather than the good one
[17:39:25] <mutante>	 You know, on Friday afternoon I looked at this and was like "yea, no, dont merge Fridays"
[17:39:30] <mutante>	 glad you were here too
[17:39:49] <mutante>	 but puppet run fixing it is cool of course
[17:40:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2211.codfw.wmnet with OS bookworm
[17:40:49] <mutante>	 does it have certs for wikipedia.ro and wikimedia.ro now
[17:41:30] <mutante>	 on ncredir1002 it did not crash on puppet run, *nod*
[17:42:49] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2126.codfw.wmnet - https://phabricator.wikimedia.org/T383395#10454666 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:44:16] <jelto>	 !log homer 'lsw1-c3-codfw*' commit 'T377877'
[17:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:26] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[17:45:13] <jelto>	 !log sudo homer 'cr*codfw*' commit 'T377877'
[17:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:27] <mutante>	 vgutierrez: I checked with openssl too and I see the .ro names on the "live" file. all good :) ttyl
[17:46:24] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 128, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:46:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2208-2211].codfw.wmnet
[17:46:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2208-2211].codfw.wmnet
[17:47:37] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383595 (10Jelto) 03NEW
[17:48:15] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10454740 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:51:42] <icinga-wm>	 RECOVERY - Disk space on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1075&var-datasource=eqiad+prometheus/ops
[17:52:44] <wikibugs>	 (03PS1) 10Jelto: Rename mw241[2-5] to wikikube-worker22[12-15] [puppet] - 10https://gerrit.wikimedia.org/r/1110822 (https://phabricator.wikimedia.org/T377877)
[17:52:52] <wikibugs>	 (03PS8) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984)
[17:53:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:53:56] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4788/co" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800)
[18:00:05] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800).
[18:04:40] <wikibugs>	 (03CR) 10Btullis: [C:03+1] logstash - remove legacy eventlogging related input and filters [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[18:05:38] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1110823
[18:06:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Rename mw241[2-5] to wikikube-worker22[12-15] [puppet] - 10https://gerrit.wikimedia.org/r/1110822 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[18:06:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1110823 (owner: 10Majavah)
[18:07:21] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:07:57] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:08:55] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 7.722 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:11] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:40] <wikibugs>	 (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[18:09:53] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[18:10:04] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:10:22] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10454957 (10Jhancock.wm) i updated the ticket with that info. it might be related. still working with Dell.
[18:10:23] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10454956 (10kamila)
[18:13:33] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková)
[18:14:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1445 to wikikube-worker1096
[18:14:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[18:15:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[18:15:45] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi)
[18:15:59] <wikibugs>	 (03PS1) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:16:01] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 (owner: 10Scott French)
[18:16:26] <wikibugs>	 (03PS1) 10Dzahn: planet: remove smash.ro from Romanian feeds [puppet] - 10https://gerrit.wikimedia.org/r/1110827 (https://phabricator.wikimedia.org/T383580)
[18:16:30] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:17:03] <wikibugs>	 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455000 (10andrea.denisse)
[18:17:14] <wikibugs>	 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455002 (10andrea.denisse)
[18:17:51] <wikibugs>	 (03PS2) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:17:55] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:17:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "domain is clearly for sale - and update service crashed trying to parse the feed" [puppet] - 10https://gerrit.wikimedia.org/r/1110827 (https://phabricator.wikimedia.org/T383580) (owner: 10Dzahn)
[18:19:01] <mutante>	 kamila_: we have a merge conflict. my side is harmless. yours might be more tricky, host renames. I leave it to you when to merge both at once.
[18:19:36] <wikibugs>	 (03PS3) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:19:43] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:19:45] <kamila_>	 mutante: doing it right now
[18:19:51] <mutante>	 kamila_: ack, thanks:)
[18:21:04] <kamila_>	 done
[18:21:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1446 to wikikube-worker1097
[18:21:27] <wikibugs>	 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455034 (10andrea.denisse) a:03Clement_Goubert Thanks Claime, I'm removing the o11y tag and assigning this to you as you currently have...
[18:21:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[18:21:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1445 to wikikube-worker1096 - kamila@cumin1002"
[18:22:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1445 to wikikube-worker1096 - kamila@cumin1002"
[18:22:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:22:16] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1096
[18:23:25] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1096
[18:23:34] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1445 to wikikube-worker1096
[18:23:54] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw1445 to wikikube-worker1096 completed: - mw1445 (**PASS**)   - ✔️ Downt...
[18:25:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1446 to wikikube-worker1097 - kamila@cumin1002"
[18:25:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1446 to wikikube-worker1097 - kamila@cumin1002"
[18:25:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:25:23] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1097
[18:26:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1097
[18:26:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1446 to wikikube-worker1097
[18:26:55] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1096.eqiad.wmnet wikikube-worker1097.eqiad.wmnet on all recursors
[18:26:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1096.eqiad.wmnet wikikube-worker1097.eqiad.wmnet on all recursors
[18:26:59] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw1446 to wikikube-worker1097 completed: - mw1446 (**PASS**)   - ✔️ Downt...
[18:27:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1096.eqiad.wmnet with OS bookworm
[18:27:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1096
[18:27:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1096
[18:27:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1097.eqiad.wmnet with OS bookworm
[18:28:02] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1097
[18:28:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1097
[18:28:03] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-worker1096.eqiad.wmnet with OS bookworm
[18:28:04] <wikibugs>	 (03PS4) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:28:14] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-worker1097.eqiad.wmnet with OS bookworm
[18:28:24] <wikibugs>	 (03PS5) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:28:26] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:28:27] <wikibugs>	 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10455072 (10Dzahn) New acme-chief config has been deployed and ncredir* hosts now have a TLS cert for wikimedia.ro and wikipedia.ro.
[18:29:28] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:30:52] <wikibugs>	 (03PS6) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799)
[18:30:56] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:33:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[18:41:16] <wikibugs>	 (03PS1) 10DCausse: search: update WDQS update lag SLI/SLO queries [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833
[18:42:49] <wikibugs>	 (03CR) 10DCausse: "Categories are now reporting their lag in prometheus and seems to leak into the series used by this SLI/SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse)
[18:43:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1096.eqiad.wmnet with reason: host reimage
[18:43:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1097.eqiad.wmnet with reason: host reimage
[18:43:21] <wikibugs>	 (03PS3) 10Scott French: shellbox-syntaxhighlight: 1 eqiad replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087579 (https://phabricator.wikimedia.org/T377038)
[18:43:22] <wikibugs>	 (03PS3) 10Scott French: shellbox-syntaxhighlight: all eqiad replicas on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087580 (https://phabricator.wikimedia.org/T377038)
[18:43:24] <wikibugs>	 (03PS3) 10Scott French: shellbox-syntaxhighlight: 1 codfw replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087581 (https://phabricator.wikimedia.org/T377038)
[18:43:30] <wikibugs>	 (03PS3) 10Scott French: shellbox-syntaxhighlight: all replicas on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087582 (https://phabricator.wikimedia.org/T377038)
[18:44:59] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:45:23] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:47:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1096.eqiad.wmnet with reason: host reimage
[18:47:59] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 5.907 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:48:14] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:49:05] <wikibugs>	 (03CR) 10CDanis: "pcc lgtm: https://puppet-compiler.wmflabs.org/output/1110826/5239/" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[18:50:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1097.eqiad.wmnet with reason: host reimage
[18:52:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] codesearch: Remove obsolete apt pinning code for buster [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479) (owner: 10Muehlenhoff)
[18:53:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[19:01:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop on codesearch9.codesearch" [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479) (owner: 10Muehlenhoff)
[19:04:33] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1096.eqiad.wmnet with OS bookworm
[19:04:53] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker1096.eqiad.wmnet with OS bookworm completed: - wikiku...
[19:08:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks for rolling this out everywhere!" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[19:08:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1097.eqiad.wmnet with OS bookworm
[19:09:09] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455203 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker1097.eqiad.wmnet with OS bookworm completed: - wikiku...
[19:13:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:05] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "Can deploy later today" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse)
[19:28:24] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:13] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10455426 (10VRiley-WMF) @Marostegui  I have replaced thr cable, could you check this?
[19:46:23] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455430 (10Tgr) >>! In T383513#10453786, @matmarex wrote: > Beta cluster Logstash data says that object...
[19:56:41] <cdanis>	 jouncebot: nowandnext
[19:56:42] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 3 minute(s)
[19:56:42] <jouncebot>	 In 1 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2100)
[19:59:18] <wikibugs>	 (03CR) 10CDanis: [C:03+2] haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis)
[20:00:23] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455480 (10Tgr) systemctl says ` Jan 11 08:09:58 deployment-sessionstore06 systemd[1]: cassandra.servic...
[20:02:48] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10455483 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF I had to take the server down in order replace the drive. I will move forward with closing the ticket.
[20:02:49] <kamila_>	 !log homer cr*eqiad* commit T354791
[20:05:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1096-1097].eqiad.wmnet
[20:05:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1096-1097].eqiad.wmnet
[20:05:52] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455510 (10ops-monitoring-bot) pool host wikikube-worker[1096-1097].eqiad.wmnet by kamila@cumin1002 with reason: None
[20:05:56] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455511 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by kamila@cumin1002 pool for host wikikube-worker[1096-1097].eqiad.wmnet completed: - wiki...
[20:07:33] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620 (10kamila) 03NEW
[20:10:02] <wikibugs>	 (03CR) 10Mforns: "I do not fully understand what each line does, but I get this is realted to the addition of the file and filerevision tables that we talke" [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup)
[20:10:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T383076#10455533 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Duplicate
[20:13:13] <wikibugs>	 06SRE, 10Observability-Metrics, 10superset.wikimedia.org, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761#10455539 (10Gehel)
[20:17:08] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455563 (10Tgr) 05Open→03Resolved a:03Tgr Optimistically closing, maybe Cassandra just needs...
[20:18:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10455570 (10VRiley-WMF)
[20:18:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383475#10455572 (10VRiley-WMF) →14Duplicate dup:03T382984
[20:20:13] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] "No new traffic since Jan 7:  https://grafana.wikimedia.org/goto/QUy5p-DHR?orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[20:24:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10455622 (10phaultfinder)
[20:29:03] <wikibugs>	 (03PS1) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843
[20:30:14] <wikibugs>	 (03PS1) 10Ottomata: logstash - remove absented input [puppet] - 10https://gerrit.wikimedia.org/r/1110844 (https://phabricator.wikimedia.org/T238230)
[20:30:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins)
[20:31:41] <wikibugs>	 (03PS1) 10Ottomata: admin - ensure unused eventlogging groups are absent [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230)
[20:32:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin - ensure unused eventlogging groups are absent [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[20:32:29] <wikibugs>	 (03CR) 10Ottomata: "There are a couple of places where e.g. eventlogging-admins is included (webperf)." [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[20:39:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10455755 (10phaultfinder)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2100). nyaa~
[21:00:05] <jouncebot>	 kemayo and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:09] <Pppery>	 here
[21:00:14] <Kemayo>	 Also here
[21:05:15] <Kemayo>	 cdanis: Anyone around to do the deployment?
[21:07:53] <cdanis>	 Kemayo: technically not an SRE responsibility 😅 but I'll help out
[21:08:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch)
[21:08:54] <Kemayo>	 I'll admit that I might tend to lump SRE and Releng into the same bucket in my head. >_>
[21:09:16] <wikibugs>	 (03Merged) 10jenkins-bot: Set Flow to read-only on phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch)
[21:09:36] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]]
[21:09:40] <stashbot>	 T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834
[21:11:33] <cdanis>	 Kemayo: would having deploy access be useful to you, btw?
[21:12:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:12:27] <cdanis>	 also Kemayo your patch is on k8s-mwdebug
[21:12:41] <Kemayo>	 cdanis: It might occasionally. I'm the most common person to be doing backports for Editing, but historically it's gone okay just fitting them into the existing windows.
[21:13:02] <Kemayo>	 cdanis: Looks good, go ahead and continue.
[21:14:30] <logmsgbot>	 !log cdanis@deploy2002 kemayo, cdanis: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:14:33] <logmsgbot>	 !log cdanis@deploy2002 kemayo, cdanis: Continuing with sync
[21:15:29] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4789/co" [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall)
[21:16:40] <Pppery>	 Feel free to skip "update the interwiki cache" when you get to my entries if you don't feel comfortable doing it - the process is a bit complicated and it will happen by itself in a week or two anyway - just figured I could save some trouble since I was attending a backport window anyway
[21:17:17] <cdanis>	 Pppery: thanks, I'm short on time and pinch-hitting so I'll skip that :)
[21:17:30] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:17:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455925 (10VRiley-WMF) 05Open→03In progress
[21:17:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455927 (10VRiley-WMF) Rebooting Now
[21:18:49] <wikibugs>	 (03Merged) 10jenkins-bot: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:21:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall)
[21:22:18] <cdanis>	 Kemayo: could these maintenance script errors have anything to do with your patch? https://logstash.wikimedia.org/goto/063ead7eb8f71773424aa37d54c6840f
[21:23:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455949 (10VRiley-WMF) This has been rebooted  @cmooney would you be able to check this when you have a chance?
[21:23:25] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:23:39] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]] (duration: 14m 02s)
[21:23:42] <stashbot>	 T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834
[21:23:57] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]]
[21:24:06] <stashbot>	 T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379
[21:24:06] <stashbot>	 T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080
[21:24:06] <stashbot>	 T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463
[21:24:34] <Kemayo>	 cdanis: I don't see how they possibly could.
[21:24:43] <cdanis>	 cool
[21:25:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:25:36] <cdanis>	 Pppery: is your patch one that it makes sense to check on the testservers?
[21:25:40] <Pppery>	 yes
[21:26:37] <cdanis>	 Pppery: okay, your patch should ~now be live on k8s-mwdebug
[21:26:41] <Pppery>	 looking
[21:28:52] <Pppery>	 Well I missed one of the settings I was supposed to change but the patch doesn't break anything so it's still safe to sync
[21:28:57] <logmsgbot>	 !log cdanis@deploy2002 cdanis, pppery: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:28:57] <cdanis>	 👍
[21:29:04] <logmsgbot>	 !log cdanis@deploy2002 cdanis, pppery: Continuing with sync
[21:30:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:31:39] <wikibugs>	 (03PS1) 10Pppery: Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379)
[21:32:38] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:34:21] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:35:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:35:05] <Pppery>	 Thanks for deploying the follow-up too!
[21:37:21] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]] (duration: 13m 23s)
[21:37:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:37:26] <stashbot>	 T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379
[21:37:26] <stashbot>	 T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080
[21:37:27] <stashbot>	 T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463
[21:37:40] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]]
[21:37:42] <wikibugs>	 (03PS1) 10Bking: cloudelastic: remove cloudelastic100[56] from conftool, add 101[12] [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T378368)
[21:38:07] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[21:40:41] <cdanis>	 Pppery: k8s testservers ready :)
[21:41:43] <Pppery>	 Looks good
[21:42:46] <logmsgbot>	 !log cdanis@deploy2002 cdanis, pppery: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:42:49] <logmsgbot>	 !log cdanis@deploy2002 cdanis, pppery: Continuing with sync
[21:42:52] <stashbot>	 T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379
[21:42:52] <stashbot>	 T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080
[21:42:53] <stashbot>	 T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463
[21:49:43] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash - remove absented input [puppet] - 10https://gerrit.wikimedia.org/r/1110844 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[21:50:53] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]] (duration: 13m 12s)
[21:50:59] <stashbot>	 T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379
[21:51:00] <stashbot>	 T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080
[21:51:00] <stashbot>	 T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463
[21:51:06] <Pppery>	 Thanks, and sorry for causing you so much trouble.
[21:53:00] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: service=(cloudelastic-chi-ssl|cloudelastic-psi-ssl|cloudelastic-omega-ssl|cloudelastic-chi-ssl-public|cloudelastic-psi-ssl-public|cloudelastic-omega-ssl-public),name=cloudelastic1005.eqiad.wmnet
[21:53:32] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: service=(cloudelastic-chi-ssl|cloudelastic-psi-ssl|cloudelastic-omega-ssl|cloudelastic-chi-ssl-public|cloudelastic-psi-ssl-public|cloudelastic-omega-ssl-public),name=cloudelastic1006.eqiad.wmnet
[21:59:10] <wikibugs>	 06SRE, 10observability, 10Observability-Logging, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q2): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10456171 (10andrea.denisse) p:05Triage→03Medium
[21:59:34] <wikibugs>	 (03PS1) 10Pppery: Add simplewiki to mobile-anon-talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161)
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2200).
[22:02:44] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] "Patch looks good and has followed site request process! Go ahead and deploy as needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) (owner: 10Pppery)
[22:07:28] <wikibugs>	 (03PS1) 10CDanis: add kemayo to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1110867
[22:09:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456188 (10phaultfinder)
[22:10:04] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:21:11] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall)
[22:22:58] <wikibugs>	 (03PS2) 10Bking: cloudelastic: remove cloudelastic100[56] from conftool, add 101[12] [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T380937)
[22:23:25] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456229 (10phaultfinder)
[22:25:29] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:26:14] <logmsgbot>	 !log cwhite@deploy2002 Started deploy [statsv/statsv@42a4331]: T382729
[22:26:18] <stashbot>	 T382729: statsv: track metric types handled - https://phabricator.wikimedia.org/T382729
[22:26:23] <logmsgbot>	 !log cwhite@deploy2002 Finished deploy [statsv/statsv@42a4331]: T382729 (duration: 00m 08s)
[22:26:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10Observability-Alerting: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350#10456238 (10andrea.denisse) Hi @cmooney,  I noticed that patch 915489 has been merged. Do you know if there’s any remaining...
[22:36:30] <wikibugs>	 (03PS1) 10JHathaway: postfix: increase message size limit from 10MiB to 50MiB [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271)
[22:36:46] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271) (owner: 10JHathaway)
[22:37:50] <wikibugs>	 (03PS2) 10Andrea Denisse: profile::mediawiki::common: Remove obsolete DSH group check [puppet] - 10https://gerrit.wikimedia.org/r/1110872 (https://phabricator.wikimedia.org/T370527)
[22:50:37] <wikibugs>	 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10456325 (10BCornwall) 05In progress→03Resolved This is all done now. Thanks all!
[22:51:05] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:51:55] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:59] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:27:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456415 (10phaultfinder)
[23:41:17] <wikibugs>	 (03PS1) 10Scott French: P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324)
[23:42:17] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French)
[23:46:43] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638 (10phaultfinder) 03NEW
[23:46:59] <wikibugs>	 (03PS1) 10Btullis: airflow: Allow specific task pods to access the kube-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430)