[00:05:18] <jinxer-wm>	 FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431216 (10phaultfinder)
[00:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:38:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204
[00:38:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204 (owner: 10TrainBranchBot)
[00:47:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:55:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204 (owner: 10TrainBranchBot)
[01:08:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205
[01:08:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205 (owner: 10TrainBranchBot)
[01:10:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431221 (10phaultfinder)
[01:28:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205 (owner: 10TrainBranchBot)
[01:39:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431225 (10phaultfinder)
[01:46:14] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/16c5b546293da1ed2c2ef67102132ed14beb602f4822313386be962e884bf289/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:06:14] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:11:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431249 (10phaultfinder)
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[03:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[03:11:27] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, and 3 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10431257 (10tstarling) I confirmed that this is working on testwiki.
[03:13:19] <wikibugs>	 (03PS3) 10Tim Starling: Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322)
[03:14:20] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling)
[03:15:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling)
[03:17:54] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]]
[03:17:57] <stashbot>	 T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322
[03:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:31:01] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[03:31:04] <stashbot>	 T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322
[03:31:21] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[03:39:15] <jinxer-wm>	 FIRING: HttpdUnreachable: httpd unavailable for deployment mw-wikifunctions at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[03:39:47] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]] (duration: 21m 53s)
[03:39:49] <stashbot>	 T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322
[03:44:15] <jinxer-wm>	 RESOLVED: HttpdUnreachable: httpd unavailable for deployment mw-wikifunctions at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable
[04:05:18] <jinxer-wm>	 FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:47:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:06:30] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10431315 (10tstarling) 05Open→03Resolved I did a new benchmark with a method following T292322#10402444. The time taken from the job start to ffmpeg...
[05:17:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx)
[05:46:34] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1108208
[05:47:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1108208 (owner: 10Marostegui)
[05:50:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71785 and previous config saved to /var/cache/conftool/dbconfig/20250106-055029-root.json
[05:55:50] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108209 (https://phabricator.wikimedia.org/T382944)
[05:56:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108209 (https://phabricator.wikimedia.org/T382944) (owner: 10Marostegui)
[05:57:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2021 from dbctl T382944', diff saved to https://phabricator.wikimedia.org/P71786 and previous config saved to /var/cache/conftool/dbconfig/20250106-055726-marostegui.json
[05:57:30] <stashbot>	 T382944: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944
[05:59:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108210 (https://phabricator.wikimedia.org/T382944)
[06:00:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2021.codfw.wmnet
[06:02:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108210 (https://phabricator.wikimedia.org/T382944) (owner: 10Marostegui)
[06:04:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[06:05:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71787 and previous config saved to /var/cache/conftool/dbconfig/20250106-060534-root.json
[06:08:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:08:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:08:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:08:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2021.codfw.wmnet
[06:08:48] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10431338 (10Marostegui)
[06:09:10] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10431341 (10Marostegui) a:05Marostegui→03None Ready for #dc-ops
[06:11:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[06:12:43] <wikibugs>	 (03PS1) 10Marostegui: backup2002.cnf.erb: Replace es2022 with es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1108309 (https://phabricator.wikimedia.org/T381259)
[06:14:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] backup2002.cnf.erb: Replace es2022 with es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1108309 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui)
[06:16:50] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108310 (https://phabricator.wikimedia.org/T382945)
[06:17:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108310 (https://phabricator.wikimedia.org/T382945) (owner: 10Marostegui)
[06:18:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2020 from dbctl T382945', diff saved to https://phabricator.wikimedia.org/P71789 and previous config saved to /var/cache/conftool/dbconfig/20250106-061832-marostegui.json
[06:18:36] <stashbot>	 T382945: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945
[06:19:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2020.codfw.wmnet
[06:20:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71790 and previous config saved to /var/cache/conftool/dbconfig/20250106-062040-root.json
[06:20:59] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2020 [puppet] - 10https://gerrit.wikimedia.org/r/1108311 (https://phabricator.wikimedia.org/T382945)
[06:23:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[06:27:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:27:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:27:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:27:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2020.codfw.wmnet
[06:27:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2020 [puppet] - 10https://gerrit.wikimedia.org/r/1108311 (https://phabricator.wikimedia.org/T382945) (owner: 10Marostegui)
[06:28:14] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10431362 (10Marostegui) a:05Marostegui→03None
[06:28:35] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10431367 (10Marostegui) This is ready for #dc-ops
[06:30:41] <wikibugs>	 (03PS1) 10Marostegui: backup2002.cnf.er: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259)
[06:31:09] <wikibugs>	 (03PS2) 10Marostegui: backup2002.cnf.erb: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259)
[06:33:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] backup2002.cnf.erb: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui)
[06:35:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71791 and previous config saved to /var/cache/conftool/dbconfig/20250106-063545-root.json
[06:37:43] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108313 (https://phabricator.wikimedia.org/T382946)
[06:38:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108313 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui)
[06:39:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2022 from dbctl T382946', diff saved to https://phabricator.wikimedia.org/P71792 and previous config saved to /var/cache/conftool/dbconfig/20250106-063940-marostegui.json
[06:39:43] <stashbot>	 T382946: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946
[06:45:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2022 [puppet] - 10https://gerrit.wikimedia.org/r/1108314 (https://phabricator.wikimedia.org/T382946)
[06:46:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2022.codfw.wmnet
[06:50:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71794 and previous config saved to /var/cache/conftool/dbconfig/20250106-065050-root.json
[06:51:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[06:53:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2022 [puppet] - 10https://gerrit.wikimedia.org/r/1108314 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui)
[06:54:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:54:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[06:54:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:54:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2022.codfw.wmnet
[06:55:04] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `es2022.codfw.wmnet` - es2022.codfw.wmnet (**PASS**)   - Downtimed...
[06:55:05] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431389 (10Marostegui) a:05Marostegui→03None
[06:55:15] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431394 (10Marostegui) Ready for #dc-ops
[06:55:24] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431395 (10Marostegui)
[07:04:02] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108315 (https://phabricator.wikimedia.org/T368874)
[07:04:50] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Switchover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1108316 (https://phabricator.wikimedia.org/T368874)
[07:09:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] "All green in Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/1108315 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui)
[07:09:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[07:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[07:13:22] <marostegui>	 !log dbmaint Switchover m3 (phabricator) eqiad master dbproxy1020 -> dbproxy1028 T368874
[07:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:24] <stashbot>	 T368874: Productionize dbproxy102[89] - https://phabricator.wikimedia.org/T368874
[07:13:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1108316 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui)
[07:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:29:02] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[07:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:32] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: Icon is not visible and returns an error when attempting to view as a PNG - https://phabricator.wikimedia.org/T383023#10431437 (10Aklapper) > * Clicking on the PNG previews displays an error  Hmm, for the small rendered preview at https://upload.wikimedia.org/wikipedia/commo...
[07:44:19] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431453 (10Aklapper)
[07:49:51] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431458 (10DavidEppstein) Discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Image_Preview_Issue for https:/...
[07:50:20] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431463 (10Cyberdog958) {F58132792}  {F58132795} It looks like not just SVG files are affected as others are having the same problem wit...
[07:55:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2053-2054].codfw.wmnet
[07:58:37] <awight>	 I'd be happy to deploy, if needed!
[07:59:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2053-2054].codfw.wmnet
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T0800).
[08:00:05] <jouncebot>	 hubaishan, DreamRimmer, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2053.codfw.wmnet with OS bookworm
[08:00:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2054.codfw.wmnet with OS bookworm
[08:00:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2053
[08:00:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2053
[08:00:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2054
[08:00:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2054
[08:02:13] <awight>	 deploy2002 needs an SSH fingerprint published: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy2002.eqiad.wmnet
[08:02:30] <awight>	 I got SHA256:meS3gCKwHzJWtflhVLOotPQVkYEpexjddK6hna5/t/0 , hopefully this is right?
[08:03:49] <wikibugs>	 06SRE, 06Commons: Backend fetch failed - https://phabricator.wikimedia.org/T383013#10431501 (10Aklapper)
[08:03:49] <DreamRimmer>	 o/
[08:04:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2023 T383026', diff saved to https://phabricator.wikimedia.org/P71795 and previous config saved to /var/cache/conftool/dbconfig/20250106-080405-marostegui.json
[08:04:08] <stashbot>	 T383026: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026
[08:04:20] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:05:18] <jinxer-wm>	 FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:05:35] <wikibugs>	 (03CR) 10Awight: "Permissions look similar to other wikis.  Could add `oathauth-enable` as well, if desired?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan)
[08:05:38] <wikibugs>	 (03PS1) 10Marostegui: es2023: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108386 (https://phabricator.wikimedia.org/T383026)
[08:06:06] <awight>	 hubaishan: I'll begin deployment now :-)
[08:06:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 T381848', diff saved to https://phabricator.wikimedia.org/P71796 and previous config saved to /var/cache/conftool/dbconfig/20250106-080609-marostegui.json
[08:06:12] <stashbot>	 T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848
[08:06:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2023: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108386 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui)
[08:06:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan)
[08:07:24] <hashar>	 awight: are you running the backports this morning?
[08:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: [arwiki] Add templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan)
[08:07:50] <awight>	 hashar: yes :-)
[08:07:50] <logmsgbot>	 !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]]
[08:07:50] <hashar>	 looks like :-]
[08:07:53] <stashbot>	 T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784
[08:07:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2023 from dbctl and promote es2046 to es5 master T381848 T383026', diff saved to https://phabricator.wikimedia.org/P71797 and previous config saved to /var/cache/conftool/dbconfig/20250106-080755-marostegui.json
[08:08:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 and es2025 T381848', diff saved to https://phabricator.wikimedia.org/P71798 and previous config saved to /var/cache/conftool/dbconfig/20250106-080845-marostegui.json
[08:10:02] <wikibugs>	 (03PS1) 10Marostegui: es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108388 (https://phabricator.wikimedia.org/T383026)
[08:10:11] <awight>	 hashar: let me know if the process has changed lately, though?  I'm randomly jumping in on a quiet morning...
[08:10:32] <hashar>	 I am not aware of any change
[08:10:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108388 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui)
[08:10:50] <hashar>	 I am asking cause I woke up only 20 minutes ago (I have a bad cough :(  )
[08:11:32] <wikibugs>	 06SRE, 10Wikidata, 06Wikidata Dev Team, 07Performance Issue: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10431535 (10Bugreporter)
[08:11:56] <awight>	 hashar: sorry to hear it!  My morning was rough as well, vacation was long but not particularly relaxing ;-)
[08:13:32] <moritzm>	 !log installing fastnetmon security updates
[08:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:58] <logmsgbot>	 !log awight@deploy2002 awight, hubaishan: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:14:01] <stashbot>	 T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784
[08:14:26] <awight>	 hubaishan: DreamRimmer: please test the templateeditor patch on mwdebug servers
[08:15:42] <dcausse>	 !log restarting blazegraph on wdqs1014 (BlazegraphFreeAllocatorsDecreasingRapidly)
[08:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:05] <hubaishan>	 It is OK.
[08:17:19] <awight>	 ty!
[08:17:22] <logmsgbot>	 !log awight@deploy2002 awight, hubaishan: Continuing with sync
[08:18:23] <dcausse>	 !log restarting blazegraph on wdqs1012 (stuck with high thread count)
[08:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage
[08:18:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage
[08:19:06] <awight>	 hashar: hmm, "sync-testservers-k8s" takes 4 minutes, and sync-masters 7 seconds.  Should we look into the test server slowness or is this a known / intentional thing?
[08:21:11] <hashar>	 4 minutes sounds normal?
[08:21:28] <hashar>	 cause that is a helm deployment
[08:21:30] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108389 (https://phabricator.wikimedia.org/T381848)
[08:21:43] <hashar>	 the sync-masters is fast cause that is syncing the baremetal spare depoyment server
[08:21:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:22:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage
[08:22:26] <hashar>	 and there is nothing new to sync beside the config file(s) affected by your change
[08:22:29] <awight>	 hashar: kk thanks it makes sense
[08:22:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:22:38] <logmsgbot>	 !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]] (duration: 14m 48s)
[08:22:41] <stashbot>	 T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784
[08:22:53] <hashar>	 I think the first sync on monday might take a while if we had some l10n updates received
[08:23:14] <wikibugs>	 (03PS3) 10Awight: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer)
[08:23:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108389 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui)
[08:23:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer)
[08:24:15] <wikibugs>	 (03Merged) 10jenkins-bot: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer)
[08:24:33] <logmsgbot>	 !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]]
[08:24:36] <stashbot>	 T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649
[08:24:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:25:58] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:25:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage
[08:29:12] <logmsgbot>	 !log awight@deploy2002 awight, dreamrimmer: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:30:36] <awight>	 DreamRimmer: Please check the ptwikinews license change.  (I'm not seeing where this config is surfaced in the UI, fwiw)
[08:30:59] <DreamRimmer>	 checking
[08:33:29] <DreamRimmer>	 looks good to me
[08:35:26] <wikibugs>	 (03CR) 10Awight: "Looks right, but I can't find anywhere that this text will be surfaced.  The Collection extension uses the config, but interestingly I'm f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer)
[08:36:30] <logmsgbot>	 !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]]
[08:36:33] <stashbot>	 T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649
[08:36:40] <awight>	 DreamRimmer: thanks--unfortunately I need to restart that deployment.
[08:37:02] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1240-1244].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[08:37:20] <DreamRimmer>	 no problem
[08:37:54] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1245-1249].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[08:38:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1240.eqiad.wmnet with OS bookworm
[08:39:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1245.eqiad.wmnet with OS bookworm
[08:41:04] <logmsgbot>	 !log awight@deploy2002 dreamrimmer, awight: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:41:12] <logmsgbot>	 !log awight@deploy2002 dreamrimmer, awight: Continuing with sync
[08:41:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2054.codfw.wmnet with OS bookworm
[08:42:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2053.codfw.wmnet with OS bookworm
[08:43:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2053-2054].codfw.wmnet
[08:43:08] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[2053-2054].codfw.wmnet
[08:44:20] <wikibugs>	 (03CR) 10Awight: "This one seems odd because it's redundant with the default.  Are you sure we need it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[08:44:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2053-2054].codfw.wmnet
[08:44:27] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[2053-2054].codfw.wmnet
[08:44:30] <awight>	 DreamRimmer: left a question for you on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1106911
[08:45:26] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:45:59] <logmsgbot>	 !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] (duration: 09m 28s)
[08:46:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2054.codfw.wmnet
[08:46:02] <stashbot>	 T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649
[08:46:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2054.codfw.wmnet
[08:46:12] <awight>	 DreamRimmer: I'll leave that aside for a moment...
[08:46:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2053.codfw.wmnet
[08:46:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2053.codfw.wmnet
[08:47:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx)
[08:47:48] <icinga-wm>	 PROBLEM - Disk space on rpki2003 is CRITICAL: DISK CRITICAL - free space: /var/lib/routinator/repository 163MiB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=rpki2003&var-datasource=codfw+prometheus/ops
[08:50:20] <wikibugs>	 (03Merged) 10jenkins-bot: bjnwikiquote: add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx)
[08:50:39] <logmsgbot>	 !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]]
[08:50:42] <stashbot>	 T382777: Request for Implementation of the Wikiquote Banjar wordmark for bjn.wikiquote.org - https://phabricator.wikimedia.org/T382777
[08:55:27] <logmsgbot>	 !log awight@deploy2002 awight, anzx: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:55:35] <anzx>	 awight: already see logo change, good to sync
[08:57:04] <awight>	 anzx: thanks!
[08:57:07] <logmsgbot>	 !log awight@deploy2002 awight, anzx: Continuing with sync
[08:57:49] <wikibugs>	 (03PS2) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946)
[08:58:59] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage
[08:59:52] <wikibugs>	 (03CR) 10Awight: "PS 2 includes an unrelated change..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[09:00:07] <anzx>	 awight: please purge wordmark post sync https://www.irccloud.com/pastebin/r3NQH9OL
[09:00:11] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage
[09:01:53] <wikibugs>	 (03CR) 10Dreamrimmer: "I will fix it and do it in the next backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[09:02:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage
[09:02:39] <logmsgbot>	 !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]] (duration: 11m 59s)
[09:02:42] <stashbot>	 T382777: Request for Implementation of the Wikiquote Banjar wordmark for bjn.wikiquote.org - https://phabricator.wikimedia.org/T382777
[09:03:21] <awight>	 !log UTC morning deployment finished
[09:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2051-2052].codfw.wmnet
[09:04:29] <anzx>	 awight: thank you for deploying
[09:04:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[09:05:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2051-2052].codfw.wmnet
[09:05:42] <awight>	 gladly!
[09:06:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LG!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[09:06:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage
[09:06:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2051.codfw.wmnet with OS bookworm
[09:06:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2052.codfw.wmnet with OS bookworm
[09:06:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2051
[09:06:34] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2051
[09:06:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2052
[09:06:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2052
[09:10:19] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:10:31] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:18:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] planet_sync: Cleanup time handling [puppet] - 10https://gerrit.wikimedia.org/r/1105875 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:21:17] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1240.eqiad.wmnet with OS bookworm
[09:23:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1241.eqiad.wmnet with OS bookworm
[09:23:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage
[09:24:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage
[09:25:10] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1245.eqiad.wmnet with OS bookworm
[09:25:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2160.codfw.wmnet with reason: upgrade kernel
[09:25:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2160.codfw.wmnet with reason: upgrade kernel
[09:26:37] <marostegui>	 !log Reboot db2160 for kernel upgrade T376905
[09:26:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:55] <dcausse>	 !log depooling wdqs1012 (high lag, forgot to keep it depooled after restarting blazegraph)
[09:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage
[09:29:30] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1246.eqiad.wmnet with OS bookworm
[09:29:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:30:08] <wikibugs>	 (03PS2) 10Muehlenhoff: planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565)
[09:31:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage
[09:31:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:33:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:35:59] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[09:37:35] <wikibugs>	 (03PS5) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798)
[09:39:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:41:10] <dcausse>	 !log repooling wdqs1012
[09:41:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:38] <wikibugs>	 (03Abandoned) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[09:43:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage
[09:46:23] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:46:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2051.codfw.wmnet with OS bookworm
[09:47:27] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage
[09:48:08] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431764 (10Aklapper) Cannot reproduce from Central Europe; works as expected here.  What's the exact output (except for your IP) if you try to directly access the thum...
[09:50:04] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431768 (10MatthewVernon) I went looking at swift, and e.g. the Buick thumbnail is correct (and identical) in both clusters.
[09:50:13] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage
[09:50:43] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:51:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2052.codfw.wmnet with OS bookworm
[09:52:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2051.codfw.wmnet
[09:52:15] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2051.codfw.wmnet
[09:52:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2052.codfw.wmnet
[09:52:29] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2052.codfw.wmnet
[09:53:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2049-2050].codfw.wmnet
[09:54:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage
[09:54:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2049-2050].codfw.wmnet
[09:55:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2049.codfw.wmnet with OS bookworm
[09:55:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2050.codfw.wmnet with OS bookworm
[09:55:28] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431790 (10MatthewVernon) FWIW, this thumb is in both swift clusters: ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commons-local-thumb.f8 'f/f8/Apostolic_Nunci...
[09:55:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2049
[09:55:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2049
[09:55:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2050
[09:55:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2050
[09:56:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1103313 (owner: 10Muehlenhoff)
[09:57:39] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431815 (10Aklapper) See also {T383023} which is a bit similar.
[10:02:12] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431827 (10MatthewVernon) Yeah, I've seen that (I see all the swift-tagged tickets, lucky me), I'll comment there as well.
[10:03:09] <wikibugs>	 (03CR) 10Marostegui: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[10:05:52] <icinga-wm>	 PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:06:13] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431831 (10MatthewVernon) Showing my working, again these are not new thumbs: ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commo...
[10:06:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1241.eqiad.wmnet with OS bookworm
[10:06:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[10:06:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[10:06:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:06:50] <claime>	 !log Deploying admin_ng external services changes on all kubernetes clusters
[10:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:00] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:07:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71801 and previous config saved to /var/cache/conftool/dbconfig/20250106-100706-ladsgroup.json
[10:07:09] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[10:07:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:07:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:08:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:08:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1242.eqiad.wmnet with OS bookworm
[10:08:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:09:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:09:33] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:09:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:10:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:10:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[10:11:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[10:11:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[10:12:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[10:12:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:13:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage
[10:13:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:13:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:13:36] <wikibugs>	 (03Restored) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[10:13:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:13:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1246.eqiad.wmnet with OS bookworm
[10:15:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage
[10:15:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1247.eqiad.wmnet with OS bookworm
[10:16:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage
[10:16:56] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10431868 (10Ladsgroup) The script is done with 0f: ` root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.0f                    URL: http://ms-fe.svc.codfw.wmnet/v1/AUTH...
[10:16:59] <wikibugs>	 (03Abandoned) 10Hashar: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes)
[10:18:57] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431872 (10Cyberdog958) The discussion was a little hard to understand, but it looks like they were talking about this thumbnail: https:...
[10:19:48] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage
[10:21:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff)
[10:21:50] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108405 (https://phabricator.wikimedia.org/T383025)
[10:22:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108405 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui)
[10:22:32] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:22:55] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431894 (10MatthewVernon) So that's https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Buick_Regal_2_--_10-30-2009.jpg/280px-Buic...
[10:24:09] <wikibugs>	 (03CR) 10Ladsgroup: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[10:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431897 (10phaultfinder)
[10:25:28] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431900 (10Cyberdog958) {F58133216} No I get the same error on both my main computer and my phone.
[10:25:32] <wikibugs>	 (03PS3) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946)
[10:26:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "No mentions of the flag in wmf.8 code, which is fully deployed by now:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor)
[10:26:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor)
[10:28:36] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage
[10:32:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage
[10:35:33] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:35:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage
[10:36:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2050.codfw.wmnet with OS bookworm
[10:37:32] <wikibugs>	 (03CR) 10Marostegui: "Yeah, let's go for 5 now and then we can see if we need further adjustments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[10:39:01] <icinga-wm>	 RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:39:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2049.codfw.wmnet with OS bookworm
[10:39:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage
[10:39:38] <wikibugs>	 (03PS2) 10Ladsgroup: ParserCache: Set connect and recieve timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076)
[10:39:51] <wikibugs>	 (03CR) 10Ladsgroup: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[10:40:10] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[10:41:03] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431941 (10MatthewVernon) Can you get your browser's developer tools option to dump the request, please? That should give us the HTTP st...
[10:42:23] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431944 (10MatthewVernon) (if you're using chrome, I think [[ https://stackoverflow.com/questions/4423061/how-can-i-view-http-headers-in...
[10:49:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2049.codfw.wmnet
[10:49:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2049.codfw.wmnet
[10:49:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2050.codfw.wmnet
[10:49:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2050.codfw.wmnet
[10:50:29] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432010 (10Cyberdog958) HTTP/2 401  content-type: text/html; charset=UTF-8 content-length: 131 www-authenticate: Swift realm="AUTH_mw" d...
[10:51:07] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432011 (10Cyberdog958) I'm using firefox but that's what it spit out.
[10:51:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1242.eqiad.wmnet with OS bookworm
[10:51:36] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432013 (10MatthewVernon) Hm, this isn't correct ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commons-local-thumb.f8 Container '...
[10:52:50] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Add caps to allow ceph-csi-cephfs to work with the dumps filesystem [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[10:52:58] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[10:54:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432018 (10phaultfinder)
[10:55:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a storageclass for the dumps file system (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[10:57:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:58:26] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1247.eqiad.wmnet with OS bookworm
[10:59:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add a storageclass for the dumps file system [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1100)
[11:00:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1248.eqiad.wmnet with OS bookworm
[11:06:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2046,2048].codfw.wmnet
[11:07:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2046,2048].codfw.wmnet
[11:08:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2048.codfw.wmnet with OS bookworm
[11:08:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2046.codfw.wmnet with OS bookworm
[11:08:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2048
[11:08:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2046
[11:08:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2046
[11:08:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2048
[11:09:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[11:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[11:12:32] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:12:48] <icinga-wm>	 PROBLEM - BGP status on lsw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:16:13] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:16:22] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:18:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432068 (10phaultfinder)
[11:20:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage
[11:24:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage
[11:26:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage
[11:28:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage
[11:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:31:01] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: debmonitor: show OS release name in the host view - https://phabricator.wikimedia.org/T240193#10432093 (10hashar) 05Invalid→03Resolved a:03elukey
[11:32:10] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage
[11:33:19] <wikibugs>	 (03PS1) 10Btullis: Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490)
[11:35:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage
[11:45:29] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1248.eqiad.wmnet with OS bookworm
[11:47:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1249.eqiad.wmnet with OS bookworm
[11:47:29] <icinga-wm>	 RECOVERY - Host doc2002 is UP: PING OK - Packet loss = 0%, RTA = 30.64 ms
[11:47:54] <moritzm>	 !log fix /etc/network/interfaces on doc2002 T382610
[11:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:57] <stashbot>	 T382610: Low disk space: doc1003 / doc2002 - https://phabricator.wikimedia.org/T382610
[11:48:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71803 and previous config saved to /var/cache/conftool/dbconfig/20250106-114844-ladsgroup.json
[11:48:47] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[11:50:18] <jinxer-wm>	 RESOLVED: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:33] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:52:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2048.codfw.wmnet with OS bookworm
[11:54:57] <icinga-wm>	 RECOVERY - BGP status on lsw1-a8-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:55:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2046.codfw.wmnet with OS bookworm
[11:56:03] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2023 [puppet] - 10https://gerrit.wikimedia.org/r/1108412 (https://phabricator.wikimedia.org/T383026)
[11:56:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2023.codfw.wmnet
[11:57:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2023 [puppet] - 10https://gerrit.wikimedia.org/r/1108412 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui)
[12:01:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[12:03:28] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432151 (10Yiming) >>! 在T383034#10431764中，@Aklapper写道： > Cannot reproduce from Central Europe; works as expected here. >  > What's the exact output (except for your IP...
[12:03:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P71804 and previous config saved to /var/cache/conftool/dbconfig/20250106-120351-ladsgroup.json
[12:04:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[12:04:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[12:04:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:04:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2023.codfw.wmnet
[12:06:02] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10432154 (10Marostegui) a:05Marostegui→03None
[12:06:07] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10432159 (10Marostegui) This is ready for #dc-ops
[12:07:25] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage
[12:11:01] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage
[12:14:06] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[12:15:46] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432175 (10ZhaoFJx) Yiming discussed with me, and I just want to say that image can be opened in North America for me
[12:18:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P71805 and previous config saved to /var/cache/conftool/dbconfig/20250106-121858-ladsgroup.json
[12:25:10] <wikibugs>	 (03PS1) 10Jon Harald Søby: Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934)
[12:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:25:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby)
[12:27:06] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[12:27:29] <Emperor>	 !log swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media' ms-fe2009 per T383034
[12:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:32] <stashbot>	 T383034: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034
[12:28:39] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432181 (10MatthewVernon) @Cyberdog958 I think this should be resolved now. Can you try again, please?
[12:30:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1249.eqiad.wmnet with OS bookworm
[12:30:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1245-1249].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[12:34:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71806 and previous config saved to /var/cache/conftool/dbconfig/20250106-123405-ladsgroup.json
[12:34:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[12:34:08] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[12:34:10] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[12:34:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71807 and previous config saved to /var/cache/conftool/dbconfig/20250106-123416-ladsgroup.json
[12:34:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432189 (10phaultfinder)
[12:35:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452)
[12:37:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1217.eqiad.wmnet with reason: upgrade kernel
[12:37:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1217.eqiad.wmnet with reason: upgrade kernel
[12:38:32] <Amir1>	 jouncebot: nowandnext
[12:38:33] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 21 minute(s)
[12:38:33] <jouncebot>	 In 1 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1400)
[12:39:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[12:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: ParserCache: Set connect and recieve timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup)
[12:40:55] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]]
[12:40:59] <stashbot>	 T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076
[12:40:59] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[12:41:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:17] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:17] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:31] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:33] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:45] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:47] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:55] <marostegui>	 ^ expected
[12:43:01] <claime>	 ack
[12:43:17] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:17] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:31] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:33] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:45] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:47] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:44:01] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:44:01] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:46:34] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:46:38] <stashbot>	 T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076
[12:46:38] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[12:48:26] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[12:51:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2048.codfw.wmnet
[12:51:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2048.codfw.wmnet
[12:51:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2046.codfw.wmnet
[12:51:40] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2046.codfw.wmnet
[12:52:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2044-2045].codfw.wmnet
[12:53:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2044-2045].codfw.wmnet
[12:54:34] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] (duration: 13m 39s)
[12:54:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bookworm
[12:54:38] <stashbot>	 T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076
[12:54:38] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[12:54:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2044.codfw.wmnet with OS bookworm
[12:54:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2045
[12:54:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2045
[12:55:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2044
[12:55:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2044
[12:57:30] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff)
[12:58:39] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:58:39] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:59:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432238 (10phaultfinder)
[13:04:30] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432256 (10Yiming) Update:  I also found that https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Zh-wikipedia-200611121821.png/104px-Zh-wikipedia-200611121821.p...
[13:07:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete WMCS Puppet 5 master classes no longer used/needed [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798)
[13:08:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:11:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage
[13:12:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage
[13:13:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[13:14:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798)
[13:15:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage
[13:17:59] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432320 (10Cyberdog958) Yes it is now working on all my devices. Thanks for the fix.
[13:18:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage
[13:20:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:25:50] <wikibugs>	 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051 (10JMeybohm) 03NEW
[13:29:01] <wikibugs>	 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432351 (10JMeybohm) The following commands have to be executed when the host is back (just noting it down so I don't for...
[13:34:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2044.codfw.wmnet with OS bookworm
[13:34:39] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:37:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:38:02] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432376 (10JMeybohm)
[13:38:10] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[13:38:41] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:39:20] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2045.codfw.wmnet with OS bookworm
[13:39:45] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432382 (10JMeybohm) a:03Jclark-ctr
[13:40:15] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432384 (10JMeybohm)
[13:40:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2045.codfw.wmnet
[13:40:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2045.codfw.wmnet
[13:40:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2044.codfw.wmnet
[13:40:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2044.codfw.wmnet
[13:41:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057 - https://phabricator.wikimedia.org/T381676#10432390 (10JMeybohm) a:03Jclark-ctr
[13:41:27] <wikibugs>	 (03Merged) 10jenkins-bot: Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[13:42:08] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10432395 (10JMeybohm)
[13:42:11] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4731/" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth)
[13:42:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057.eqiad.wmnet - https://phabricator.wikimedia.org/T381676#10432397 (10JMeybohm)
[13:43:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10432399 (10JMeybohm) a:03Jclark-ctr
[13:44:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10432405 (10JMeybohm)
[13:44:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10432407 (10JMeybohm) a:03Jclark-ctr
[13:44:55] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "this will create two blackbox checks, one in eqiad and one in codfw both probing `doc.wikimedia.org`. The blackbox check should be gated b" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth)
[13:44:56] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432408 (10MatthewVernon) @Yiming no, that's a different problem - you're getting throttled because of repeated thumbnail generation failures for that file. Which is b...
[13:45:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10432410 (10JMeybohm)
[13:46:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2042-2043].codfw.wmnet
[13:47:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2042-2043].codfw.wmnet
[13:47:29] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1250-1252].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:47:39] <wikibugs>	 06SRE, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 2 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#10432418 (10Jdforrester-WMF)
[13:49:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2043.codfw.wmnet with OS bookworm
[13:49:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2042.codfw.wmnet with OS bookworm
[13:49:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2043
[13:49:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2043
[13:49:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2042
[13:49:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2042
[13:51:45] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1250.eqiad.wmnet with OS bookworm
[13:53:09] <icinga-wm>	 PROBLEM - BGP status on lsw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:53:41] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:56:00] <wikibugs>	 (03PS6) 10Jforrester: ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000)
[13:56:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000)
[13:58:32] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053 (10MatthewVernon) 03NEW
[13:58:38] <wikibugs>	 (03CR) 10Jforrester: Update French wikinews license to CC-BY-SA 4.0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[13:59:10] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432447 (10MatthewVernon)
[13:59:11] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432448 (10MatthewVernon)
[13:59:12] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432449 (10MatthewVernon)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1400).
[14:00:05] <jouncebot>	 DreamRimmer, Lucas_WMDE, Jhs, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:08] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432451 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [closing this task, leaving the parent for looking at the underl...
[14:00:38] <wikibugs>	 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432455 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The presenting issue is fixed, there's a parent task for the underlying issue.
[14:01:11] <Jhs>	 o/ present
[14:01:22] <DreamRimmer>	 hello
[14:01:25] <James_F>	 Is anyone else around to deploy?
[14:01:49] <Lucas_WMDE>	 o/
[14:01:57] <James_F>	 Eh, OK, I'll do it.
[14:02:00] <Lucas_WMDE>	 I can deploy
[14:02:16] <James_F>	 Oh, awesome, over to Lucas_WMDE.
[14:03:04] <Lucas_WMDE>	 so many changes
[14:03:05] * Lucas_WMDE looks
[14:04:15] <marostegui>	 !log Deploy schema change on x1 dbmaint eqiad T383052
[14:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:17] <stashbot>	 T383052: Full table scan query on wikishared - https://phabricator.wikimedia.org/T383052
[14:05:32] <Lucas_WMDE>	 any thoughts on https://phabricator.wikimedia.org/T382879#10432491 ? (about the 2FA change)
[14:05:49] <Lucas_WMDE>	 let’s go ahead with the first two changes by DreamRimmer for now
[14:06:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107028 (https://phabricator.wikimedia.org/T382785) (owner: 10Dreamrimmer)
[14:06:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108171 (https://phabricator.wikimedia.org/T382887) (owner: 10Dreamrimmer)
[14:07:12] <wikibugs>	 (03Merged) 10jenkins-bot: Add mergehistory to import and transwiki on en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107028 (https://phabricator.wikimedia.org/T382785) (owner: 10Dreamrimmer)
[14:07:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add suppressredirect and delete-redirect to en.wikinews reviewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108171 (https://phabricator.wikimedia.org/T382887) (owner: 10Dreamrimmer)
[14:07:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]]
[14:07:36] <stashbot>	 T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785
[14:07:36] <stashbot>	 T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887
[14:07:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage
[14:08:17] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432510 (10MatthewVernon)
[14:08:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage
[14:10:09] <marostegui>	 !log Deploy schema change on x1 dbmaint codfw T383052
[14:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:12] <stashbot>	 T383052: Full table scan query on wikishared - https://phabricator.wikimedia.org/T383052
[14:10:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage
[14:12:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage
[14:12:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:12:56] <stashbot>	 T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785
[14:12:57] <stashbot>	 T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887
[14:13:08] <DreamRimmer>	 checking
[14:13:23] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432515 (10MatthewVernon) Narrow the time window down thus:  ` sudo cumin "A:codfw and P{O:swift::proxy}" "zgrep -F 'wikipedia-commons-local-thumb.f8' /var/log/swift/proxy-a...
[14:13:24] <Lucas_WMDE>	 thanks
[14:14:19] <Lucas_WMDE>	 changes on enwikibooks and enwikinews look good to me fwiw
[14:14:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage
[14:15:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71808 and previous config saved to /var/cache/conftool/dbconfig/20250106-141520-ladsgroup.json
[14:15:23] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[14:15:24] <DreamRimmer>	 both look good to me
[14:15:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync
[14:15:28] <DreamRimmer>	 https://en.wikibooks.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=usergroups
[14:15:50] <Lucas_WMDE>	 and after that I’d actually go out-of-order and prioritize James_F, the ExtensionDistributor update sounds more important to me than my config cleanup or Jhs’ extra language codes
[14:18:11] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage
[14:22:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]] (duration: 15m 02s)
[14:22:38] <stashbot>	 T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785
[14:22:38] <stashbot>	 T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887
[14:22:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000)
[14:23:31] <wikibugs>	 (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000)
[14:23:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]]
[14:23:53] <stashbot>	 T372331: Mark REL1_43 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T372331
[14:23:53] <stashbot>	 T376550: Formally EOL MW 1.41 - https://phabricator.wikimedia.org/T376550
[14:24:36] <James_F>	 Whee.
[14:25:15] <Lucas_WMDE>	 oh, I should’ve asked if you wanted to self-service I guess ^^
[14:25:56] <DreamRimmer>	 thanks, Lucas
[14:26:35] <James_F>	 Lucas_WMDE: It's more than fine. Thank you! :-)
[14:26:54] <Lucas_WMDE>	 ok :)
[14:27:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 macfan4000, lucaswerkmeister-wmde: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:28:44] <moritzm>	 !log installing libvirt bugfix updates
[14:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:15] <Lucas_WMDE>	 https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Wikibase looks good to me with WikimediaDebug
[14:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432623 (10phaultfinder)
[14:30:20] <James_F>	 Lucas_WMDE: Yeah, all good to deploy.
[14:30:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71809 and previous config saved to /var/cache/conftool/dbconfig/20250106-143027-ladsgroup.json
[14:30:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 macfan4000, lucaswerkmeister-wmde: Continuing with sync
[14:30:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2043.codfw.wmnet with OS bookworm
[14:30:51] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:32:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10432639 (10MoritzMuehlenhoff)
[14:33:28] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[14:33:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2042.codfw.wmnet with OS bookworm
[14:34:11] <icinga-wm>	 RECOVERY - BGP status on lsw1-a8-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:34:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2043.codfw.wmnet
[14:34:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2043.codfw.wmnet
[14:34:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2042.codfw.wmnet
[14:34:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2042.codfw.wmnet
[14:34:39] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Phabricator: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056 (10WMDE-leszek) 03NEW
[14:35:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2040-2041].codfw.wmnet
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]] (duration: 14m 14s)
[14:38:08] <stashbot>	 T372331: Mark REL1_43 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T372331
[14:38:08] <stashbot>	 T376550: Formally EOL MW 1.41 - https://phabricator.wikimedia.org/T376550
[14:38:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby)
[14:38:41] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1250.eqiad.wmnet with OS bookworm
[14:39:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby)
[14:39:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2040-2041].codfw.wmnet
[14:39:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]]
[14:39:22] <stashbot>	 T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934
[14:40:29] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1251.eqiad.wmnet with OS bookworm
[14:40:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2041.codfw.wmnet with OS bookworm
[14:40:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2040.codfw.wmnet with OS bookworm
[14:40:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2041
[14:40:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2041
[14:40:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2040
[14:40:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2040
[14:44:47] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:51] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:45:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71810 and previous config saved to /var/cache/conftool/dbconfig/20250106-144534-ladsgroup.json
[14:45:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jhsoby: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:46:02] <stashbot>	 T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934
[14:46:04] <Lucas_WMDE>	 Jhs: please test :)
[14:47:50] <Jhs>	 Lucas_WMDE, works as expected 👍 
[14:49:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jhsoby: Continuing with sync
[14:49:16] <Lucas_WMDE>	 \o/
[14:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:56:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]] (duration: 17m 11s)
[14:56:34] <stashbot>	 T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934
[14:57:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor)
[14:58:02] <wikibugs>	 (03Merged) 10jenkins-bot: Remove EntitySchema DataType feature flag - is always enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor)
[14:58:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]]
[14:58:23] <stashbot>	 T333667: [ES-M5] Remove temporary feature flag for EntitySchema Datatype again - https://phabricator.wikimedia.org/T333667
[14:58:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage
[14:58:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage
[15:00:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71811 and previous config saved to /var/cache/conftool/dbconfig/20250106-150040-ladsgroup.json
[15:00:44] <stashbot>	 T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742
[15:00:53] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 13Patch-For-Review: Alert in need of triage: PuppetConstantChange (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T382870#10432768 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:00:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage
[15:01:27] <Lucas_WMDE>	 oh dear, the window’s already over?
[15:01:30] <Lucas_WMDE>	 jouncebot: now
[15:01:30] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 28 minute(s)
[15:01:33] <Lucas_WMDE>	 ok phew
[15:01:37] <Lucas_WMDE>	 I’ll just keep deploying my config cleanup then
[15:02:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:02:30] * Lucas_WMDE tests
[15:03:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage
[15:04:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Continuing with sync
[15:04:21] <Lucas_WMDE>	 works afaict
[15:05:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432786 (10phaultfinder)
[15:06:39] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Looks technically fine but not deployed today per my comments on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae)
[15:09:34] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage
[15:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[15:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[15:11:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]] (duration: 13m 25s)
[15:11:49] <stashbot>	 T333667: [ES-M5] Remove temporary feature flag for EntitySchema Datatype again - https://phabricator.wikimedia.org/T333667
[15:12:08] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432813 (10MatthewVernon) I found nothing on the proxy-servers, but on ms-be2058 (the first node in the ring for this container), I find (`#012` in log line converted to new...
[15:12:40] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:12:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:46] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10432815 (10Aklapper)
[15:12:51] <wikibugs>	 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10432816 (10Aklapper)
[15:12:59] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:13:08] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:15:18] <wikibugs>	 (03CR) 10AOkoth: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth)
[15:17:43] <wikibugs>	 (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth)
[15:19:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:19:35] <cdanis>	 !incidents
[15:19:35] <sirenbot>	 5580 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[15:19:37] <cdanis>	 !ack 5580
[15:19:38] <sirenbot>	 5580 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[15:20:21] <claime>	 here
[15:22:53] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:23:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2041.codfw.wmnet with OS bookworm
[15:23:28] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1244.eqiad.wmnet with OS bookworm
[15:24:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:24:55] <wikibugs>	 (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:25:21] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1251.eqiad.wmnet with OS bookworm
[15:27:03] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1252.eqiad.wmnet with OS bookworm
[15:27:40] <wikibugs>	 06SRE, 06Traffic: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062 (10Diskdance) 03NEW
[15:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:40] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2040.codfw.wmnet with OS bookworm
[15:29:58] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:31:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2040.codfw.wmnet
[15:31:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2040.codfw.wmnet
[15:31:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2041.codfw.wmnet
[15:31:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2041.codfw.wmnet
[15:31:52] <wikibugs>	 (03PS2) 10Ottomata: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817)
[15:32:00] <wikibugs>	 (03CR) 10Ottomata: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[15:33:12] <wikibugs>	 (03CR) 10Ssingh: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[15:35:23] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432882 (10Jhancock.wm) going to replace the disk. two notes the server is out of warranty so it's a repurposed disk. getting an error on DIMM B6. going to replace it as well from decommed stock.    Th...
[15:36:08] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432883 (10Marostegui) Thank you @Jhancock.wm!
[15:38:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2143.codfw.wmnet with reason: onsite maintenance
[15:39:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2143.codfw.wmnet with reason: onsite maintenance
[15:48:06] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432897 (10MatthewVernon) Similar errors similarly timestamped on the other two storage nodes ms-be2073 and ms-be2074
[15:51:09] <moritzm>	 !log uploaded openjdk-21 21.0.5+11-1~deb12u1 to apt.wikimedia.org component/jdk21
[15:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[15:55:23] <cdanis>	 !incidents
[15:55:23] <sirenbot>	 5581 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[15:55:23] <sirenbot>	 5580 (RESOLVED)  NELHigh sre (thanos-rule tcp.timed_out)
[15:55:25] <cdanis>	 !ack 5581
[15:55:25] <sirenbot>	 5581 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[15:55:43] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: "if one of you +1s this I'll schedule backport :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[16:01:03] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Reactivate Parsoid+Kartographer on hewiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[16:04:01] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432947 (10Jhancock.wm) powered up and both alerts have cleared. Does everything look good on your end? @Marostegui
[16:05:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[16:05:54] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432956 (10Marostegui) It looks good, the RAID is rebuilding: ` Slot Number: 2 Firmware state: Rebuild `  And the memory errors have vanished. I think we can close this!  Thank you so much
[16:06:32] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064 (10ops-monitoring-bot) 03NEW
[16:06:50] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432966 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm np!
[16:08:54] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064#10432980 (10Jhancock.wm) gonna decline this one shortly. it popped up as we were fixing T382751
[16:11:05] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064#10432998 (10Marostegui) 05Open→03Declined The RAID is correctly rebuilding as part of T382751
[16:15:23] <wikibugs>	 (03PS2) 10Abijeet Patro: Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892)
[16:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:26:13] <wikibugs>	 (03PS21) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[16:26:45] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10433081 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:28:26] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10433098 (10MatthewVernon)
[16:28:52] <wikibugs>	 (03PS1) 10Btullis: Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490)
[16:29:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[16:30:05] <jouncebot>	 jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1630)
[16:33:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[16:33:26] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10433113 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:37:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis)
[16:37:50] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:38:01] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:39:44] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:39:53] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:42:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[16:42:09] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[16:42:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T370903)', diff saved to https://phabricator.wikimedia.org/P71813 and previous config saved to /var/cache/conftool/dbconfig/20250106-164215-ladsgroup.json
[16:42:19] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:42:29] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546)
[16:44:04] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:44:07] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10433154 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:44:43] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:47:20] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617)
[16:48:29] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10433181 (10MatthewVernon) All three database files have different checksums, but the same failure of integrity check: ` mvernon@ms-be2073:~$ sqlite3 4077d9...
[16:49:18] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10433184 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:49:50] <wikibugs>	 (03PS1) 10Marostegui: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108453 (https://phabricator.wikimedia.org/T383028)
[16:50:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108453 (https://phabricator.wikimedia.org/T383028) (owner: 10Marostegui)
[16:51:04] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza)
[16:52:35] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza)
[16:52:47] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage
[16:53:38] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:54:10] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:54:11] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:54:52] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:54:54] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:54:56] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2004.codfw.wmnet - https://phabricator.wikimedia.org/T382877#10433201 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:55:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T370903)', diff saved to https://phabricator.wikimedia.org/P71815 and previous config saved to /var/cache/conftool/dbconfig/20250106-165503-ladsgroup.json
[16:55:06] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:55:24] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:56:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage
[16:58:07] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1108451| Bumping portals to master (T128546)]] (duration: 12m 29s)
[16:58:10] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[17:00:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#10433228 (10bd808)
[17:00:51] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1108451| Bumping portals to master (T128546)]] (duration: 02m 43s)
[17:05:21] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10433256 (10Papaul) The interface on this server is showing 100Mb/s it should be 1000Mb/s  ` es1043:~$ sudo ethtool eno8303 | grep  Speed  Speed: 100Mb/s ` on the switch it self the speed is se...
[17:09:59] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2003.codfw.wmnet - https://phabricator.wikimedia.org/T382875#10433261 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:10:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71816 and previous config saved to /var/cache/conftool/dbconfig/20250106-171010-ladsgroup.json
[17:12:13] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10433270 (10Marostegui) Those hosts aren't in production and don't have alerting, so you can proceed as needed whenever you want!
[17:13:01] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10433273 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:13:34] <wikibugs>	 (03PS1) 10David Caro: helm-sudo: use the right binary [puppet] - 10https://gerrit.wikimedia.org/r/1108455
[17:15:18] <logmsgbot>	 !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1244.eqiad.wmnet with OS bookworm
[17:15:31] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1240-1244].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:15:52] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1252.eqiad.wmnet with OS bookworm
[17:15:56] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1250-1252].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:16:17] <wikibugs>	 (03CR) 10David Caro: [C:03+2] helm-sudo: use the right binary [puppet] - 10https://gerrit.wikimedia.org/r/1108455 (owner: 10David Caro)
[17:16:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2002.codfw.wmnet - https://phabricator.wikimedia.org/T382868#10433299 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:16:53] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1244.eqiad.wmnet with OS bookworm
[17:22:10] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10433344 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:25:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71817 and previous config saved to /var/cache/conftool/dbconfig/20250106-172517-ladsgroup.json
[17:27:17] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10433403 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[17:28:16] <wikibugs>	 (03CR) 10Herron: "Nice!  Couple of nonblocking questions and thoughts for you inline, mostly about how instance overrides will/could work." [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[17:28:22] <wikibugs>	 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10433409 (10DavidEppstein) The two images I was having trouble viewing before are now good. Thanks!
[17:32:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10433433 (10Jhancock.wm) @Andrew checking back on this one. anything i can help with?
[17:35:35] <wikibugs>	 06SRE, 06Traffic: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062#10433442 (10Lucas_Werkmeister_WMDE) > For non-JavaScript fallback, we can just choose to show or hide the IP completely (Cloudflare chooses the latter).  A [<select> element](https://developer.mozill...
[17:36:59] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1244.eqiad.wmnet with reason: host reimage
[17:38:18] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108456
[17:39:51] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1244.eqiad.wmnet with reason: host reimage
[17:40:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T370903)', diff saved to https://phabricator.wikimedia.org/P71818 and previous config saved to /var/cache/conftool/dbconfig/20250106-174024-ladsgroup.json
[17:40:27] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:42:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381635#10433466 (10Papaul) 05Open→03Resolved a:03Papaul @VRiley-WMF there no more errors on this interface for now we can resolve this task for now if we see any other errors we can swich the cable/transceiv...
[17:44:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - fasw2-c1b-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381543#10433473 (10Papaul) 05Open→03Resolved a:03Papaul
[17:47:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[17:47:06] <wikibugs>	 (03CR) 10CDobbins: P:hardware::check: add profile to check HW configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[17:48:53] <wikibugs>	 (03PS1) 10ZhaoFJx: Enable AutoModerator on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306)
[17:51:04] <wikibugs>	 (03PS6) 10CDobbins: Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521
[17:58:08] <wikibugs>	 (03CR) 10Kgraessle: [C:03+1] Enable AutoModerator on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306) (owner: 10ZhaoFJx)
[17:59:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1244.eqiad.wmnet with OS bookworm
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1800)
[18:00:05] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1800).
[18:01:46] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T383076 (10phaultfinder) 03NEW
[18:19:11] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins)
[18:23:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Chart] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1108130 (https://phabricator.wikimedia.org/T382042) (owner: 10Jdlrobson)
[18:30:22] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] Update geo-maps file's US section [dns] - 10https://gerrit.wikimedia.org/r/1097521 (owner: 10CDobbins)
[18:31:52] <ChrisDobbins901_>	 !log cdobbins@cumin1002 running authdns-update for CR 1097521
[18:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:42] <ChrisDobbins901_>	 !log cdobbins@dns1004 running authdns-update for CR 1097521
[18:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:27] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1244.eqiad.wmnet
[18:47:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1244.eqiad.wmnet
[18:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[18:54:32] <wikibugs>	 (03PS22) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[18:58:11] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Needs to be removed from Vector. It does seem to still be in use there. See registerRequirement call after "Feature: Sticky header" commen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia)
[19:00:21] <wikibugs>	 (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[19:09:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1257-1263].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[19:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[19:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[19:10:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10433786 (10phaultfinder)
[19:11:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1257.eqiad.wmnet with OS bookworm
[19:15:19] <wikibugs>	 06SRE, 06Traffic: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062#10433802 (10ssingh) p:05Triage→03Medium
[19:20:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10433831 (10phaultfinder)
[19:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:32:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage
[19:36:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1257.eqiad.wmnet with reason: host reimage
[19:55:02] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1257.eqiad.wmnet with OS bookworm
[19:55:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1258.eqiad.wmnet with OS bookworm
[19:59:52] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "Looks right to me - While there are some existing arbcom-[lang] entries for wikipedia.org still around comments in the phab task are menti" [dns] - 10https://gerrit.wikimedia.org/r/1108142 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe)
[20:00:07] <wikibugs>	 (03PS2) 10Zabe: create wikipedia-zh-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1108142 (https://phabricator.wikimedia.org/T380119)
[20:15:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1258.eqiad.wmnet with reason: host reimage
[20:19:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1258.eqiad.wmnet with reason: host reimage
[20:22:46] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] create wikipedia-zh-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1108142 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe)
[20:24:13] <brett>	 !log running authdns-update for CR 1108142
[20:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:39:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1258.eqiad.wmnet with OS bookworm
[20:41:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1259.eqiad.wmnet with OS bookworm
[20:50:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10434189 (10phaultfinder)
[20:54:47] <wikibugs>	 06SRE, 07SRE-Unowned, 10Maps: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10434226 (10bd808)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T2100)
[21:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:23] <Jdlrobson>	 o/
[21:00:39] <cjming>	 o/
[21:01:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/Chart] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1108130 (https://phabricator.wikimedia.org/T382042) (owner: 10Jdlrobson)
[21:02:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1259.eqiad.wmnet with reason: host reimage
[21:06:01] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1259.eqiad.wmnet with reason: host reimage
[21:10:49] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10434381 (10VirginiaPoundstone)
[21:12:10] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1108130|Move logic for type infering to server (T382042)]]
[21:12:13] <stashbot>	 T382042: Remove code duplication relating to type inference - https://phabricator.wikimedia.org/T382042
[21:13:15] <wikibugs>	 (03Merged) 10jenkins-bot: Move logic for type infering to server [extensions/Chart] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1108130 (https://phabricator.wikimedia.org/T382042) (owner: 10Jdlrobson)
[21:16:11] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344#10434497 (10VirginiaPoundstone)
[21:16:28] <cjming>	 Jdlrobson: on test servers if testable
[21:16:51] <logmsgbot>	 !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1108130|Move logic for type infering to server (T382042)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:19:25] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE, 06Infrastructure-Foundations: Reduce Kerberos logs produced by Presto - https://phabricator.wikimedia.org/T353802#10434543 (10VirginiaPoundstone)
[21:20:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1122926264 and 40 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:20:23] <cjming>	 Jdlrobson: ok to sync?
[21:20:33] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561#10434561 (10VirginiaPoundstone)
[21:21:48] <Jdlrobson>	 cjming: checking now
[21:22:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:22:41] <Jdlrobson>	 cjming: good to sync!
[21:22:47] <cjming>	 cool
[21:22:50] <logmsgbot>	 !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync
[21:23:02] <wikibugs>	 06SRE, 06Data-Engineering-Icebox, 06Traffic: Webrequest x_analtics `wprov` value is incorrectly formatted - https://phabricator.wikimedia.org/T339910#10434633 (10VirginiaPoundstone)
[21:23:50] <wikibugs>	 06SRE, 06Data-Engineering-Icebox, 06Traffic, 06Trust and Safety Product Team, and 2 others: Include User-Agent Client Hints in WebRequest logs - https://phabricator.wikimedia.org/T337947#10434654 (10VirginiaPoundstone)
[21:25:04] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1259.eqiad.wmnet with OS bookworm
[21:25:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1260.eqiad.wmnet with OS bookworm
[21:27:30] <wikibugs>	 (03PS1) 10Ladsgroup: beta: Set beta cluster file tables migration to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108483 (https://phabricator.wikimedia.org/T383093)
[21:31:41] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108130|Move logic for type infering to server (T382042)]] (duration: 19m 31s)
[21:31:44] <stashbot>	 T382042: Remove code duplication relating to type inference - https://phabricator.wikimedia.org/T382042
[21:32:01] <cjming>	 Jdlrobson: should be live :)
[21:32:49] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] beta: Set beta cluster file tables migration to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108483 (https://phabricator.wikimedia.org/T383093) (owner: 10Ladsgroup)
[21:34:24] <wikibugs>	 06SRE, 06Data-Engineering-Icebox, 06Data-Platform-SRE, 06Infrastructure-Foundations: Investigate crypto KDC deprecations after Bullseye update - https://phabricator.wikimedia.org/T337544#10434831 (10VirginiaPoundstone)
[21:34:54] <wikibugs>	 06SRE, 06Data-Engineering-Icebox, 06serviceops, 06Trust-and-Safety: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464#10434838 (10VirginiaPoundstone)
[21:39:12] <cjming>	 !log end of UTC late backport window
[21:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435312 (10phaultfinder)
[21:45:40] <wikibugs>	 (03PS1) 10BCornwall: varnish: Hide X-Client-IP on error page by default [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062)
[21:45:44] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1260.eqiad.wmnet with reason: host reimage
[21:46:52] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062#10435370 (10BCornwall) My proposed patch seems the simplest method: Using basic HTML/CSS we can simply portion off any sensitive info into an expandable box. I added a red backg...
[21:49:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1260.eqiad.wmnet with reason: host reimage
[21:55:33] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: puppetmaster config in deployment-prep may be inadvertently breaking store,logstash reports? - https://phabricator.wikimedia.org/T218175#10435529 (10bd808) 05Open→03Resolved a:03bd808 At some point in the long life of this bug we moved to Puppet 7 and a new se...
[21:58:25] <Jdlrobson>	 thanks  cjming 
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T2200).
[22:10:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1260.eqiad.wmnet with OS bookworm
[22:13:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1261.eqiad.wmnet with OS bookworm
[22:14:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435554 (10phaultfinder)
[22:25:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435600 (10phaultfinder)
[22:33:24] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1261.eqiad.wmnet with reason: host reimage
[22:36:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1261.eqiad.wmnet with reason: host reimage
[22:52:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 392064752 and 136 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:52:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[22:55:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1261.eqiad.wmnet with OS bookworm
[22:56:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1262.eqiad.wmnet with OS bookworm
[22:56:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435735 (10phaultfinder)
[22:57:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:59:57] <wikibugs>	 (03PS1) 10Cwhite: logstash: add jvm options for openjdk-17 support [puppet] - 10https://gerrit.wikimedia.org/r/1108502 (https://phabricator.wikimedia.org/T353912)
[23:00:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: add jvm options for openjdk-17 support [puppet] - 10https://gerrit.wikimedia.org/r/1108502 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite)
[23:00:50] <wikibugs>	 (03PS2) 10Cwhite: logstash: add jvm options for openjdk-17 support [puppet] - 10https://gerrit.wikimedia.org/r/1108502 (https://phabricator.wikimedia.org/T353912)
[23:02:59] <wikibugs>	 (03PS1) 10Cwhite: logstash: sync jruby jvm settings from upstream [puppet] - 10https://gerrit.wikimedia.org/r/1108504
[23:03:24] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108502 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite)
[23:07:05] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: add jvm options for openjdk-17 support [puppet] - 10https://gerrit.wikimedia.org/r/1108502 (https://phabricator.wikimedia.org/T353912) (owner: 10Cwhite)
[23:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[23:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[23:16:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1262.eqiad.wmnet with reason: host reimage
[23:19:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1262.eqiad.wmnet with reason: host reimage
[23:27:35] <wikibugs>	 (03PS4) 10Scott French: mediawiki: add mercurius release generation token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105821 (https://phabricator.wikimedia.org/T382630)
[23:28:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:34:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435799 (10phaultfinder)
[23:39:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1262.eqiad.wmnet with OS bookworm
[23:41:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1263.eqiad.wmnet with OS bookworm