[00:05:18] FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431216 (10phaultfinder) [00:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204 [00:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204 (owner: 10TrainBranchBot) [00:47:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:55:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108204 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205 (owner: 10TrainBranchBot) [01:10:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431221 (10phaultfinder) [01:28:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108205 (owner: 10TrainBranchBot) [01:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431225 (10phaultfinder) [01:46:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/16c5b546293da1ed2c2ef67102132ed14beb602f4822313386be962e884bf289/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:11:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431249 (10phaultfinder) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [03:09:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [03:11:27] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, and 3 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10431257 (10tstarling) I confirmed that this is working on testwiki. [03:13:19] (03PS3) 10Tim Starling: Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) [03:14:20] (03CR) 10Tim Starling: [C:03+2] Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [03:15:02] (03Merged) 10jenkins-bot: Enable canShellboxGetTempUrl everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101239 (https://phabricator.wikimedia.org/T292322) (owner: 10Tim Starling) [03:17:54] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]] [03:17:57] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [03:28:28] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:01] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [03:31:04] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [03:31:21] !log tstarling@deploy2002 tstarling: Continuing with sync [03:39:15] FIRING: HttpdUnreachable: httpd unavailable for deployment mw-wikifunctions at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [03:39:47] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1101239|Enable canShellboxGetTempUrl everywhere (T292322)]] (duration: 21m 53s) [03:39:49] T292322: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 [03:44:15] RESOLVED: HttpdUnreachable: httpd unavailable for deployment mw-wikifunctions at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [04:05:18] FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:30] 10SRE-swift-storage, 10MW-on-K8s, 06serviceops, 10Shellbox, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322#10431315 (10tstarling) 05Open→03Resolved I did a new benchmark with a method following T292322#10402444. The time taken from the job start to ffmpeg... [05:17:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx) [05:46:34] (03PS1) 10Marostegui: Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1108208 [05:47:48] (03CR) 10Marostegui: [C:03+2] Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1108208 (owner: 10Marostegui) [05:50:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71785 and previous config saved to /var/cache/conftool/dbconfig/20250106-055029-root.json [05:55:50] (03PS1) 10Marostegui: instances.yaml: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108209 (https://phabricator.wikimedia.org/T382944) [05:56:21] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108209 (https://phabricator.wikimedia.org/T382944) (owner: 10Marostegui) [05:57:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2021 from dbctl T382944', diff saved to https://phabricator.wikimedia.org/P71786 and previous config saved to /var/cache/conftool/dbconfig/20250106-055726-marostegui.json [05:57:30] T382944: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944 [05:59:29] (03PS1) 10Marostegui: mariadb: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108210 (https://phabricator.wikimedia.org/T382944) [06:00:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2021.codfw.wmnet [06:02:35] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2021 [puppet] - 10https://gerrit.wikimedia.org/r/1108210 (https://phabricator.wikimedia.org/T382944) (owner: 10Marostegui) [06:04:49] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:05:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71787 and previous config saved to /var/cache/conftool/dbconfig/20250106-060534-root.json [06:08:08] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:08:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2021.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:08:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:08:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2021.codfw.wmnet [06:08:48] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10431338 (10Marostegui) [06:09:10] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10431341 (10Marostegui) a:05Marostegui→03None Ready for #dc-ops [06:11:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:12:43] (03PS1) 10Marostegui: backup2002.cnf.erb: Replace es2022 with es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1108309 (https://phabricator.wikimedia.org/T381259) [06:14:45] (03CR) 10Marostegui: [C:03+2] backup2002.cnf.erb: Replace es2022 with es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1108309 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:16:50] (03PS1) 10Marostegui: instances.yaml: Remove es2020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108310 (https://phabricator.wikimedia.org/T382945) [06:17:31] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2020 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108310 (https://phabricator.wikimedia.org/T382945) (owner: 10Marostegui) [06:18:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2020 from dbctl T382945', diff saved to https://phabricator.wikimedia.org/P71789 and previous config saved to /var/cache/conftool/dbconfig/20250106-061832-marostegui.json [06:18:36] T382945: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945 [06:19:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2020.codfw.wmnet [06:20:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71790 and previous config saved to /var/cache/conftool/dbconfig/20250106-062040-root.json [06:20:59] (03PS1) 10Marostegui: mariadb: Remove es2020 [puppet] - 10https://gerrit.wikimedia.org/r/1108311 (https://phabricator.wikimedia.org/T382945) [06:23:49] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:27:13] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:27:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2020.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:27:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:27:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2020.codfw.wmnet [06:27:33] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2020 [puppet] - 10https://gerrit.wikimedia.org/r/1108311 (https://phabricator.wikimedia.org/T382945) (owner: 10Marostegui) [06:28:14] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10431362 (10Marostegui) a:05Marostegui→03None [06:28:35] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10431367 (10Marostegui) This is ready for #dc-ops [06:30:41] (03PS1) 10Marostegui: backup2002.cnf.er: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259) [06:31:09] (03PS2) 10Marostegui: backup2002.cnf.erb: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259) [06:33:33] (03CR) 10Marostegui: [C:03+2] backup2002.cnf.erb: Replace es2025 with es2046 [puppet] - 10https://gerrit.wikimedia.org/r/1108312 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [06:35:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71791 and previous config saved to /var/cache/conftool/dbconfig/20250106-063545-root.json [06:37:43] (03PS1) 10Marostegui: instances.yaml: Remove es2022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108313 (https://phabricator.wikimedia.org/T382946) [06:38:29] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2022 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108313 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui) [06:39:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2022 from dbctl T382946', diff saved to https://phabricator.wikimedia.org/P71792 and previous config saved to /var/cache/conftool/dbconfig/20250106-063940-marostegui.json [06:39:43] T382946: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946 [06:45:47] (03PS1) 10Marostegui: mariadb: Remove es2022 [puppet] - 10https://gerrit.wikimedia.org/r/1108314 (https://phabricator.wikimedia.org/T382946) [06:46:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2022.codfw.wmnet [06:50:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling after schema change', diff saved to https://phabricator.wikimedia.org/P71794 and previous config saved to /var/cache/conftool/dbconfig/20250106-065050-root.json [06:51:07] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [06:53:57] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2022 [puppet] - 10https://gerrit.wikimedia.org/r/1108314 (https://phabricator.wikimedia.org/T382946) (owner: 10Marostegui) [06:54:28] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:54:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2022.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [06:54:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:54:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2022.codfw.wmnet [06:55:04] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431388 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `es2022.codfw.wmnet` - es2022.codfw.wmnet (**PASS**) - Downtimed... [06:55:05] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431389 (10Marostegui) a:05Marostegui→03None [06:55:15] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431394 (10Marostegui) Ready for #dc-ops [06:55:24] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10431395 (10Marostegui) [07:04:02] (03PS1) 10Marostegui: dbproxy1028: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108315 (https://phabricator.wikimedia.org/T368874) [07:04:50] (03PS1) 10Marostegui: wmnet: Switchover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1108316 (https://phabricator.wikimedia.org/T368874) [07:09:25] (03CR) 10Marostegui: [C:03+2] "All green in Icinga." [puppet] - 10https://gerrit.wikimedia.org/r/1108315 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:09:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [07:09:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [07:13:22] !log dbmaint Switchover m3 (phabricator) eqiad master dbproxy1020 -> dbproxy1028 T368874 [07:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:24] T368874: Productionize dbproxy102[89] - https://phabricator.wikimedia.org/T368874 [07:13:38] (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1108316 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui) [07:28:28] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:02] !log installing systemd bugfix updates from Bookworm point release [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] 10SRE-swift-storage, 10MediaViewer: Icon is not visible and returns an error when attempting to view as a PNG - https://phabricator.wikimedia.org/T383023#10431437 (10Aklapper) > * Clicking on the PNG previews displays an error Hmm, for the small rendered preview at https://upload.wikimedia.org/wikipedia/commo... [07:44:19] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431453 (10Aklapper) [07:49:51] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431458 (10DavidEppstein) Discussion at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Image_Preview_Issue for https:/... [07:50:20] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431463 (10Cyberdog958) {F58132792} {F58132795} It looks like not just SVG files are affected as others are having the same problem wit... [07:55:44] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2053-2054].codfw.wmnet [07:58:37] I'd be happy to deploy, if needed! [07:59:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2053-2054].codfw.wmnet [08:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T0800). [08:00:05] hubaishan, DreamRimmer, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:32] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2053.codfw.wmnet with OS bookworm [08:00:33] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2054.codfw.wmnet with OS bookworm [08:00:52] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2053 [08:00:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2053 [08:00:53] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2054 [08:00:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2054 [08:02:13] deploy2002 needs an SSH fingerprint published: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy2002.eqiad.wmnet [08:02:30] I got SHA256:meS3gCKwHzJWtflhVLOotPQVkYEpexjddK6hna5/t/0 , hopefully this is right? [08:03:49] 06SRE, 06Commons: Backend fetch failed - https://phabricator.wikimedia.org/T383013#10431501 (10Aklapper) [08:03:49] o/ [08:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2023 T383026', diff saved to https://phabricator.wikimedia.org/P71795 and previous config saved to /var/cache/conftool/dbconfig/20250106-080405-marostegui.json [08:04:08] T383026: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026 [08:04:20] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:05:18] FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:05:35] (03CR) 10Awight: "Permissions look similar to other wikis. Could add `oathauth-enable` as well, if desired?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan) [08:05:38] (03PS1) 10Marostegui: es2023: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108386 (https://phabricator.wikimedia.org/T383026) [08:06:06] hubaishan: I'll begin deployment now :-) [08:06:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 T381848', diff saved to https://phabricator.wikimedia.org/P71796 and previous config saved to /var/cache/conftool/dbconfig/20250106-080609-marostegui.json [08:06:12] T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848 [08:06:32] (03CR) 10Marostegui: [C:03+2] es2023: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108386 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui) [08:06:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan) [08:07:24] awight: are you running the backports this morning? [08:07:27] (03Merged) 10jenkins-bot: [arwiki] Add templateeditor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106940 (https://phabricator.wikimedia.org/T382784) (owner: 10Hubaishan) [08:07:50] hashar: yes :-) [08:07:50] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]] [08:07:50] looks like :-] [08:07:53] T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784 [08:07:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2023 from dbctl and promote es2046 to es5 master T381848 T383026', diff saved to https://phabricator.wikimedia.org/P71797 and previous config saved to /var/cache/conftool/dbconfig/20250106-080755-marostegui.json [08:08:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024 and es2025 T381848', diff saved to https://phabricator.wikimedia.org/P71798 and previous config saved to /var/cache/conftool/dbconfig/20250106-080845-marostegui.json [08:10:02] (03PS1) 10Marostegui: es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108388 (https://phabricator.wikimedia.org/T383026) [08:10:11] hashar: let me know if the process has changed lately, though? I'm randomly jumping in on a quiet morning... [08:10:32] I am not aware of any change [08:10:48] (03CR) 10Marostegui: [C:03+2] es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108388 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui) [08:10:50] I am asking cause I woke up only 20 minutes ago (I have a bad cough :( ) [08:11:32] 06SRE, 10Wikidata, 06Wikidata Dev Team, 07Performance Issue: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10431535 (10Bugreporter) [08:11:56] hashar: sorry to hear it! My morning was rough as well, vacation was long but not particularly relaxing ;-) [08:13:32] !log installing fastnetmon security updates [08:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:58] !log awight@deploy2002 awight, hubaishan: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:01] T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784 [08:14:26] hubaishan: DreamRimmer: please test the templateeditor patch on mwdebug servers [08:15:42] !log restarting blazegraph on wdqs1014 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:05] It is OK. [08:17:19] ty! [08:17:22] !log awight@deploy2002 awight, hubaishan: Continuing with sync [08:18:23] !log restarting blazegraph on wdqs1012 (stuck with high thread count) [08:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:27] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage [08:18:45] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage [08:19:06] hashar: hmm, "sync-testservers-k8s" takes 4 minutes, and sync-masters 7 seconds. Should we look into the test server slowness or is this a known / intentional thing? [08:21:11] 4 minutes sounds normal? [08:21:28] cause that is a helm deployment [08:21:30] (03PS1) 10Marostegui: wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108389 (https://phabricator.wikimedia.org/T381848) [08:21:43] the sync-masters is fast cause that is syncing the baremetal spare depoyment server [08:21:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:22:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2054.codfw.wmnet with reason: host reimage [08:22:26] and there is nothing new to sync beside the config file(s) affected by your change [08:22:29] hashar: kk thanks it makes sense [08:22:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:38] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106940|[arwiki] Add templateeditor user group (T382784)]] (duration: 14m 48s) [08:22:41] T382784: arwiki: create "templateeditor" user group and protection level - https://phabricator.wikimedia.org/T382784 [08:22:53] I think the first sync on monday might take a while if we had some l10n updates received [08:23:14] (03PS3) 10Awight: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer) [08:23:17] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1108389 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui) [08:23:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer) [08:24:15] (03Merged) 10jenkins-bot: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer) [08:24:33] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] [08:24:36] T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649 [08:24:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:58] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:25:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2053.codfw.wmnet with reason: host reimage [08:29:12] !log awight@deploy2002 awight, dreamrimmer: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:30:36] DreamRimmer: Please check the ptwikinews license change. (I'm not seeing where this config is surfaced in the UI, fwiw) [08:30:59] checking [08:33:29] looks good to me [08:35:26] (03CR) 10Awight: "Looks right, but I can't find anywhere that this text will be surfaced. The Collection extension uses the config, but interestingly I'm f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106018 (https://phabricator.wikimedia.org/T382649) (owner: 10Dreamrimmer) [08:36:30] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] [08:36:33] T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649 [08:36:40] DreamRimmer: thanks--unfortunately I need to restart that deployment. [08:37:02] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1240-1244].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [08:37:20] no problem [08:37:54] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1245-1249].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [08:38:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1240.eqiad.wmnet with OS bookworm [08:39:36] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1245.eqiad.wmnet with OS bookworm [08:41:04] !log awight@deploy2002 dreamrimmer, awight: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:12] !log awight@deploy2002 dreamrimmer, awight: Continuing with sync [08:41:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2054.codfw.wmnet with OS bookworm [08:42:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2053.codfw.wmnet with OS bookworm [08:43:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2053-2054].codfw.wmnet [08:43:08] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[2053-2054].codfw.wmnet [08:44:20] (03CR) 10Awight: "This one seems odd because it's redundant with the default. Are you sure we need it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [08:44:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2053-2054].codfw.wmnet [08:44:27] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[2053-2054].codfw.wmnet [08:44:30] DreamRimmer: left a question for you on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1106911 [08:45:26] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:45:59] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106018|Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 (T382649)]] (duration: 09m 28s) [08:46:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2054.codfw.wmnet [08:46:02] T382649: Change license on ptwikinews, nlwikinews and rowikinews to cc-by-4.0 - https://phabricator.wikimedia.org/T382649 [08:46:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2054.codfw.wmnet [08:46:12] DreamRimmer: I'll leave that aside for a moment... [08:46:16] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2053.codfw.wmnet [08:46:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2053.codfw.wmnet [08:47:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx) [08:47:48] PROBLEM - Disk space on rpki2003 is CRITICAL: DISK CRITICAL - free space: /var/lib/routinator/repository 163MiB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=rpki2003&var-datasource=codfw+prometheus/ops [08:50:20] (03Merged) 10jenkins-bot: bjnwikiquote: add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107557 (https://phabricator.wikimedia.org/T382777) (owner: 10Anzx) [08:50:39] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]] [08:50:42] T382777: Request for Implementation of the Wikiquote Banjar wordmark for bjn.wikiquote.org - https://phabricator.wikimedia.org/T382777 [08:55:27] !log awight@deploy2002 awight, anzx: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:55:35] awight: already see logo change, good to sync [08:57:04] anzx: thanks! [08:57:07] !log awight@deploy2002 awight, anzx: Continuing with sync [08:57:49] (03PS2) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) [08:58:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage [08:59:52] (03CR) 10Awight: "PS 2 includes an unrelated change..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [09:00:07] awight: please purge wordmark post sync https://www.irccloud.com/pastebin/r3NQH9OL [09:00:11] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage [09:01:53] (03CR) 10Dreamrimmer: "I will fix it and do it in the next backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [09:02:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1240.eqiad.wmnet with reason: host reimage [09:02:39] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107557|bjnwikiquote: add wordmark (T382777)]] (duration: 11m 59s) [09:02:42] T382777: Request for Implementation of the Wikiquote Banjar wordmark for bjn.wikiquote.org - https://phabricator.wikimedia.org/T382777 [09:03:21] !log UTC morning deployment finished [09:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2051-2052].codfw.wmnet [09:04:29] awight: thank you for deploying [09:04:57] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [09:05:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2051-2052].codfw.wmnet [09:05:42] gladly! [09:06:03] (03CR) 10Brouberol: [C:03+1] "LG!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [09:06:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1245.eqiad.wmnet with reason: host reimage [09:06:14] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2051.codfw.wmnet with OS bookworm [09:06:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2052.codfw.wmnet with OS bookworm [09:06:34] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2051 [09:06:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2051 [09:06:34] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2052 [09:06:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2052 [09:10:19] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:10:31] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:18:37] (03CR) 10Muehlenhoff: [C:03+2] planet_sync: Cleanup time handling [puppet] - 10https://gerrit.wikimedia.org/r/1105875 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:21:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1240.eqiad.wmnet with OS bookworm [09:23:05] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1241.eqiad.wmnet with OS bookworm [09:23:35] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage [09:24:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage [09:25:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1245.eqiad.wmnet with OS bookworm [09:25:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2160.codfw.wmnet with reason: upgrade kernel [09:25:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2160.codfw.wmnet with reason: upgrade kernel [09:26:37] !log Reboot db2160 for kernel upgrade T376905 [09:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] !log depooling wdqs1012 (high lag, forgot to keep it depooled after restarting blazegraph) [09:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2051.codfw.wmnet with reason: host reimage [09:29:30] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1246.eqiad.wmnet with OS bookworm [09:29:58] (03CR) 10Muehlenhoff: [C:03+2] planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:30:08] (03PS2) 10Muehlenhoff: planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) [09:31:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2052.codfw.wmnet with reason: host reimage [09:31:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:33:42] (03CR) 10Muehlenhoff: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:35:59] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] planet_sync: Remove obsolete options [puppet] - 10https://gerrit.wikimedia.org/r/1105876 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:37:35] (03PS5) 10Muehlenhoff: Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) [09:39:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:41:10] !log repooling wdqs1012 [09:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:38] (03Abandoned) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [09:43:34] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage [09:46:23] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2051.codfw.wmnet with OS bookworm [09:47:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1241.eqiad.wmnet with reason: host reimage [09:48:08] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431764 (10Aklapper) Cannot reproduce from Central Europe; works as expected here. What's the exact output (except for your IP) if you try to directly access the thum... [09:50:04] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431768 (10MatthewVernon) I went looking at swift, and e.g. the Buick thumbnail is correct (and identical) in both clusters. [09:50:13] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage [09:50:43] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:51:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2052.codfw.wmnet with OS bookworm [09:52:13] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2051.codfw.wmnet [09:52:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2051.codfw.wmnet [09:52:27] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2052.codfw.wmnet [09:52:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2052.codfw.wmnet [09:53:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2049-2050].codfw.wmnet [09:54:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1246.eqiad.wmnet with reason: host reimage [09:54:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2049-2050].codfw.wmnet [09:55:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2049.codfw.wmnet with OS bookworm [09:55:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2050.codfw.wmnet with OS bookworm [09:55:28] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431790 (10MatthewVernon) FWIW, this thumb is in both swift clusters: ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commons-local-thumb.f8 'f/f8/Apostolic_Nunci... [09:55:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2049 [09:55:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2049 [09:55:36] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2050 [09:55:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2050 [09:56:46] (03CR) 10Brouberol: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1103313 (owner: 10Muehlenhoff) [09:57:39] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431815 (10Aklapper) See also {T383023} which is a bit similar. [10:02:12] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10431827 (10MatthewVernon) Yeah, I've seen that (I see all the swift-tagged tickets, lucky me), I'll comment there as well. [10:03:09] (03CR) 10Marostegui: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [10:05:52] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:06:13] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431831 (10MatthewVernon) Showing my working, again these are not new thumbs: ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commo... [10:06:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1241.eqiad.wmnet with OS bookworm [10:06:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:06:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:06:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:06:50] !log Deploying admin_ng external services changes on all kubernetes clusters [10:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:07:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71801 and previous config saved to /var/cache/conftool/dbconfig/20250106-100706-ladsgroup.json [10:07:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:07:11] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:07:41] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:08:04] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:08:05] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1242.eqiad.wmnet with OS bookworm [10:08:48] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:09:03] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:09:33] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:09:49] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:10:36] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:10:53] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:11:39] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:11:55] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:12:22] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:12:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:13:05] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage [10:13:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:13:25] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:13:36] (03Restored) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [10:13:39] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:13:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1246.eqiad.wmnet with OS bookworm [10:15:10] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage [10:15:27] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1247.eqiad.wmnet with OS bookworm [10:16:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2050.codfw.wmnet with reason: host reimage [10:16:56] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10431868 (10Ladsgroup) The script is done with 0f: ` root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.0f URL: http://ms-fe.svc.codfw.wmnet/v1/AUTH... [10:16:59] (03Abandoned) 10Hashar: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) (owner: 10Brennen Bearnes) [10:18:57] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431872 (10Cyberdog958) The discussion was a little hard to understand, but it looks like they were talking about this thumbnail: https:... [10:19:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage [10:21:32] (03CR) 10Brouberol: [C:03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1108041 (owner: 10Muehlenhoff) [10:21:50] (03PS1) 10Marostegui: dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108405 (https://phabricator.wikimedia.org/T383025) [10:22:17] (03CR) 10Marostegui: [C:03+2] dbproxy1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108405 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui) [10:22:32] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:55] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431894 (10MatthewVernon) So that's https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Buick_Regal_2_--_10-30-2009.jpg/280px-Buic... [10:24:09] (03CR) 10Ladsgroup: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [10:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10431897 (10phaultfinder) [10:25:28] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431900 (10Cyberdog958) {F58133216} No I get the same error on both my main computer and my phone. [10:25:32] (03PS3) 10Dreamrimmer: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) [10:26:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "No mentions of the flag in wmf.8 code, which is fully deployed by now:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor) [10:26:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor) [10:28:36] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage [10:32:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1242.eqiad.wmnet with reason: host reimage [10:35:33] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:34] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage [10:36:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2050.codfw.wmnet with OS bookworm [10:37:32] (03CR) 10Marostegui: "Yeah, let's go for 5 now and then we can see if we need further adjustments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [10:39:01] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2049.codfw.wmnet with OS bookworm [10:39:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1247.eqiad.wmnet with reason: host reimage [10:39:38] (03PS2) 10Ladsgroup: ParserCache: Set connect and recieve timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) [10:39:51] (03CR) 10Ladsgroup: ParserCache: Set connect and recieve timeouts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [10:40:10] (03CR) 10Marostegui: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [10:41:03] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431941 (10MatthewVernon) Can you get your browser's developer tools option to dump the request, please? That should give us the HTTP st... [10:42:23] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10431944 (10MatthewVernon) (if you're using chrome, I think [[ https://stackoverflow.com/questions/4423061/how-can-i-view-http-headers-in... [10:49:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2049.codfw.wmnet [10:49:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2049.codfw.wmnet [10:49:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2050.codfw.wmnet [10:49:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2050.codfw.wmnet [10:50:29] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432010 (10Cyberdog958) HTTP/2 401 content-type: text/html; charset=UTF-8 content-length: 131 www-authenticate: Swift realm="AUTH_mw" d... [10:51:07] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432011 (10Cyberdog958) I'm using firefox but that's what it spit out. [10:51:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1242.eqiad.wmnet with OS bookworm [10:51:36] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432013 (10MatthewVernon) Hm, this isn't correct ` root@ms-fe2009:/home/mvernon# swift stat wikipedia-commons-local-thumb.f8 Container '... [10:52:50] (03CR) 10Btullis: [V:03+1 C:03+2] Add caps to allow ceph-csi-cephfs to work with the dumps filesystem [puppet] - 10https://gerrit.wikimedia.org/r/1108089 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [10:52:58] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm [10:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432018 (10phaultfinder) [10:55:43] (03CR) 10Btullis: [C:03+2] Add a storageclass for the dumps file system (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [10:57:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:58:26] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1247.eqiad.wmnet with OS bookworm [10:59:51] (03Merged) 10jenkins-bot: Add a storageclass for the dumps file system [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108090 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1100) [11:00:16] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1248.eqiad.wmnet with OS bookworm [11:06:43] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2046,2048].codfw.wmnet [11:07:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2046,2048].codfw.wmnet [11:08:35] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2048.codfw.wmnet with OS bookworm [11:08:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2046.codfw.wmnet with OS bookworm [11:08:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2048 [11:08:55] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2046 [11:08:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2046 [11:08:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2048 [11:09:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [11:09:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [11:12:32] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:12:48] PROBLEM - BGP status on lsw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:16:22] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:18:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432068 (10phaultfinder) [11:20:33] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage [11:24:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1248.eqiad.wmnet with reason: host reimage [11:26:36] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage [11:28:20] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage [11:28:28] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:01] 10SRE-tools, 06Infrastructure-Foundations: debmonitor: show OS release name in the host view - https://phabricator.wikimedia.org/T240193#10432093 (10hashar) 05Invalid→03Resolved a:03elukey [11:32:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage [11:33:19] (03PS1) 10Btullis: Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) [11:35:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2046.codfw.wmnet with reason: host reimage [11:45:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1248.eqiad.wmnet with OS bookworm [11:47:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1249.eqiad.wmnet with OS bookworm [11:47:29] RECOVERY - Host doc2002 is UP: PING OK - Packet loss = 0%, RTA = 30.64 ms [11:47:54] !log fix /etc/network/interfaces on doc2002 T382610 [11:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:57] T382610: Low disk space: doc1003 / doc2002 - https://phabricator.wikimedia.org/T382610 [11:48:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71803 and previous config saved to /var/cache/conftool/dbconfig/20250106-114844-ladsgroup.json [11:48:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:50:18] RESOLVED: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:33] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:52:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2048.codfw.wmnet with OS bookworm [11:54:57] RECOVERY - BGP status on lsw1-a8-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:55:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2046.codfw.wmnet with OS bookworm [11:56:03] (03PS1) 10Marostegui: mariadb: Remove es2023 [puppet] - 10https://gerrit.wikimedia.org/r/1108412 (https://phabricator.wikimedia.org/T383026) [11:56:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2023.codfw.wmnet [11:57:38] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2023 [puppet] - 10https://gerrit.wikimedia.org/r/1108412 (https://phabricator.wikimedia.org/T383026) (owner: 10Marostegui) [12:01:19] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [12:03:28] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432151 (10Yiming) >>! 在T383034#10431764中,@Aklapper写道: > Cannot reproduce from Central Europe; works as expected here. > > What's the exact output (except for your IP... [12:03:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P71804 and previous config saved to /var/cache/conftool/dbconfig/20250106-120351-ladsgroup.json [12:04:40] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [12:04:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2023.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [12:04:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2023.codfw.wmnet [12:06:02] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10432154 (10Marostegui) a:05Marostegui→03None [12:06:07] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10432159 (10Marostegui) This is ready for #dc-ops [12:07:25] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage [12:11:01] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1249.eqiad.wmnet with reason: host reimage [12:14:06] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1243.eqiad.wmnet with OS bookworm [12:15:46] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432175 (10ZhaoFJx) Yiming discussed with me, and I just want to say that image can be opened in North America for me [12:18:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P71805 and previous config saved to /var/cache/conftool/dbconfig/20250106-121858-ladsgroup.json [12:25:10] (03PS1) 10Jon Harald Søby: Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) [12:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:25:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby) [12:27:06] (03CR) 10Brouberol: [C:03+1] Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [12:27:29] !log swift post wikipedia-commons-local-thumb.f8 --read-acl 'mw:thumbor,mw:media,.r:*' --write-acl 'mw:thumbor,mw:media' ms-fe2009 per T383034 [12:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:32] T383034: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034 [12:28:39] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432181 (10MatthewVernon) @Cyberdog958 I think this should be resolved now. Can you try again, please? [12:30:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1249.eqiad.wmnet with OS bookworm [12:30:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1245-1249].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:34:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T371742)', diff saved to https://phabricator.wikimedia.org/P71806 and previous config saved to /var/cache/conftool/dbconfig/20250106-123405-ladsgroup.json [12:34:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [12:34:08] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:34:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [12:34:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71807 and previous config saved to /var/cache/conftool/dbconfig/20250106-123416-ladsgroup.json [12:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432189 (10phaultfinder) [12:35:42] (03PS1) 10Muehlenhoff: Remove obsolete puppetmaster::standalone role [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) [12:37:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1217.eqiad.wmnet with reason: upgrade kernel [12:37:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1217.eqiad.wmnet with reason: upgrade kernel [12:38:32] jouncebot: nowandnext [12:38:33] No deployments scheduled for the next 1 hour(s) and 21 minute(s) [12:38:33] In 1 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1400) [12:39:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [12:40:37] (03Merged) 10jenkins-bot: ParserCache: Set connect and recieve timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108141 (https://phabricator.wikimedia.org/T378076) (owner: 10Ladsgroup) [12:40:55] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] [12:40:59] T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076 [12:40:59] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:41:01] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:01] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:17] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:17] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:31] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:33] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:45] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:47] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:55] ^ expected [12:43:01] ack [12:43:17] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:17] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:31] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:33] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:45] RECOVERY - haproxy failover on dbproxy1028 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:47] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:44:01] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:44:01] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:46:34] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:46:38] T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076 [12:46:38] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:48:26] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:51:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2048.codfw.wmnet [12:51:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2048.codfw.wmnet [12:51:38] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2046.codfw.wmnet [12:51:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2046.codfw.wmnet [12:52:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2044-2045].codfw.wmnet [12:53:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2044-2045].codfw.wmnet [12:54:34] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] (duration: 13m 39s) [12:54:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bookworm [12:54:38] T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076 [12:54:38] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:54:44] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2044.codfw.wmnet with OS bookworm [12:54:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2045 [12:54:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2045 [12:55:02] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2044 [12:55:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2044 [12:57:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108428 (https://phabricator.wikimedia.org/T351452) (owner: 10Muehlenhoff) [12:58:39] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:39] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432238 (10phaultfinder) [13:04:30] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432256 (10Yiming) Update: I also found that https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Zh-wikipedia-200611121821.png/104px-Zh-wikipedia-200611121821.p... [13:07:11] (03PS1) 10Muehlenhoff: Remove obsolete WMCS Puppet 5 master classes no longer used/needed [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798) [13:08:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108430 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:11:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage [13:12:51] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage [13:13:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm [13:14:13] (03PS1) 10Muehlenhoff: Remove one additional obsolete Puppet 5 for Cloud VPS class [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) [13:15:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2044.codfw.wmnet with reason: host reimage [13:17:59] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432320 (10Cyberdog958) Yes it is now working on all my devices. Thanks for the fix. [13:18:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage [13:20:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108431 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:25:50] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051 (10JMeybohm) 03NEW [13:29:01] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432351 (10JMeybohm) The following commands have to be executed when the host is back (just noting it down so I don't for... [13:34:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2044.codfw.wmnet with OS bookworm [13:34:39] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:37:15] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432376 (10JMeybohm) [13:38:10] (03CR) 10Btullis: [C:03+2] Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [13:38:41] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:20] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2045.codfw.wmnet with OS bookworm [13:39:45] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for reimaging wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432382 (10JMeybohm) a:03Jclark-ctr [13:40:15] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243 - https://phabricator.wikimedia.org/T383051#10432384 (10JMeybohm) [13:40:20] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2045.codfw.wmnet [13:40:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2045.codfw.wmnet [13:40:34] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2044.codfw.wmnet [13:40:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2044.codfw.wmnet [13:41:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057 - https://phabricator.wikimedia.org/T381676#10432390 (10JMeybohm) a:03Jclark-ctr [13:41:27] (03Merged) 10jenkins-bot: Enable cephfs volumes for mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108411 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [13:42:08] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10432395 (10JMeybohm) [13:42:11] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4731/" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [13:42:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057.eqiad.wmnet - https://phabricator.wikimedia.org/T381676#10432397 (10JMeybohm) [13:43:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10432399 (10JMeybohm) a:03Jclark-ctr [13:44:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10432405 (10JMeybohm) [13:44:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10432407 (10JMeybohm) a:03Jclark-ctr [13:44:55] (03CR) 10Jelto: [C:04-1] "this will create two blackbox checks, one in eqiad and one in codfw both probing `doc.wikimedia.org`. The blackbox check should be gated b" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [13:44:56] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432408 (10MatthewVernon) @Yiming no, that's a different problem - you're getting throttled because of repeated thumbnail generation failures for that file. Which is b... [13:45:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10432410 (10JMeybohm) [13:46:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2042-2043].codfw.wmnet [13:47:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2042-2043].codfw.wmnet [13:47:29] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1250-1252].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:47:39] 06SRE, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 2 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#10432418 (10Jdforrester-WMF) [13:49:23] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2043.codfw.wmnet with OS bookworm [13:49:24] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2042.codfw.wmnet with OS bookworm [13:49:42] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2043 [13:49:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2043 [13:49:44] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2042 [13:49:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2042 [13:51:45] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1250.eqiad.wmnet with OS bookworm [13:53:09] PROBLEM - BGP status on lsw1-a8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:53:41] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:56:00] (03PS6) 10Jforrester: ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000) [13:56:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000) [13:58:32] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053 (10MatthewVernon) 03NEW [13:58:38] (03CR) 10Jforrester: Update French wikinews license to CC-BY-SA 4.0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [13:59:10] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432447 (10MatthewVernon) [13:59:11] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432448 (10MatthewVernon) [13:59:12] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432449 (10MatthewVernon) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1400). [14:00:05] DreamRimmer, Lucas_WMDE, Jhs, and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10432451 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [closing this task, leaving the parent for looking at the underl... [14:00:38] 10SRE-swift-storage, 06Commons: Preview images from Wikimedia Commons cannot be displayed properly - https://phabricator.wikimedia.org/T383034#10432455 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon The presenting issue is fixed, there's a parent task for the underlying issue. [14:01:11] o/ present [14:01:22] hello [14:01:25] Is anyone else around to deploy? [14:01:49] o/ [14:01:57] Eh, OK, I'll do it. [14:02:00] I can deploy [14:02:16] Oh, awesome, over to Lucas_WMDE. [14:03:04] so many changes [14:03:05] * Lucas_WMDE looks [14:04:15] !log Deploy schema change on x1 dbmaint eqiad T383052 [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:17] T383052: Full table scan query on wikishared - https://phabricator.wikimedia.org/T383052 [14:05:32] any thoughts on https://phabricator.wikimedia.org/T382879#10432491 ? (about the 2FA change) [14:05:49] let’s go ahead with the first two changes by DreamRimmer for now [14:06:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107028 (https://phabricator.wikimedia.org/T382785) (owner: 10Dreamrimmer) [14:06:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108171 (https://phabricator.wikimedia.org/T382887) (owner: 10Dreamrimmer) [14:07:12] (03Merged) 10jenkins-bot: Add mergehistory to import and transwiki on en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107028 (https://phabricator.wikimedia.org/T382785) (owner: 10Dreamrimmer) [14:07:14] (03Merged) 10jenkins-bot: Add suppressredirect and delete-redirect to en.wikinews reviewers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108171 (https://phabricator.wikimedia.org/T382887) (owner: 10Dreamrimmer) [14:07:32] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]] [14:07:36] T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785 [14:07:36] T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887 [14:07:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage [14:08:17] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432510 (10MatthewVernon) [14:08:49] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage [14:10:09] !log Deploy schema change on x1 dbmaint codfw T383052 [14:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] T383052: Full table scan query on wikishared - https://phabricator.wikimedia.org/T383052 [14:10:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2043.codfw.wmnet with reason: host reimage [14:12:14] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [14:12:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:56] T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785 [14:12:57] T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887 [14:13:08] checking [14:13:23] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432515 (10MatthewVernon) Narrow the time window down thus: ` sudo cumin "A:codfw and P{O:swift::proxy}" "zgrep -F 'wikipedia-commons-local-thumb.f8' /var/log/swift/proxy-a... [14:13:24] thanks [14:14:19] changes on enwikibooks and enwikinews look good to me fwiw [14:14:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2042.codfw.wmnet with reason: host reimage [14:15:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71808 and previous config saved to /var/cache/conftool/dbconfig/20250106-141520-ladsgroup.json [14:15:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:15:24] both look good to me [14:15:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync [14:15:28] https://en.wikibooks.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=usergroups [14:15:50] and after that I’d actually go out-of-order and prioritize James_F, the ExtensionDistributor update sounds more important to me than my config cleanup or Jhs’ extra language codes [14:18:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1250.eqiad.wmnet with reason: host reimage [14:22:34] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107028|Add mergehistory to import and transwiki on en.wikibooks (T382785)]], [[gerrit:1108171|Add suppressredirect and delete-redirect to en.wikinews reviewers (T382887)]] (duration: 15m 02s) [14:22:38] T382785: Add mergehistory to importers on en.wikibooks - https://phabricator.wikimedia.org/T382785 [14:22:38] T382887: Add suppressredirect and delete-redirect to en.wikinews reviewers - https://phabricator.wikimedia.org/T382887 [14:22:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000) [14:23:31] (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106038 (https://phabricator.wikimedia.org/T372331) (owner: 10MacFan4000) [14:23:49] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]] [14:23:53] T372331: Mark REL1_43 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T372331 [14:23:53] T376550: Formally EOL MW 1.41 - https://phabricator.wikimedia.org/T376550 [14:24:36] Whee. [14:25:15] oh, I should’ve asked if you wanted to self-service I guess ^^ [14:25:56] thanks, Lucas [14:26:35] Lucas_WMDE: It's more than fine. Thank you! :-) [14:26:54] ok :) [14:27:51] !log lucaswerkmeister-wmde@deploy2002 macfan4000, lucaswerkmeister-wmde: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:44] !log installing libvirt bugfix updates [14:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] https://www.mediawiki.org/wiki/Special:ExtensionDistributor/Wikibase looks good to me with WikimediaDebug [14:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432623 (10phaultfinder) [14:30:20] Lucas_WMDE: Yeah, all good to deploy. [14:30:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71809 and previous config saved to /var/cache/conftool/dbconfig/20250106-143027-ladsgroup.json [14:30:46] !log lucaswerkmeister-wmde@deploy2002 macfan4000, lucaswerkmeister-wmde: Continuing with sync [14:30:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2043.codfw.wmnet with OS bookworm [14:30:51] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:32:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10432639 (10MoritzMuehlenhoff) [14:33:28] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1243.eqiad.wmnet with OS bookworm [14:33:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2042.codfw.wmnet with OS bookworm [14:34:11] RECOVERY - BGP status on lsw1-a8-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:22] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2043.codfw.wmnet [14:34:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2043.codfw.wmnet [14:34:30] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2042.codfw.wmnet [14:34:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2042.codfw.wmnet [14:34:39] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 10Phabricator: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056 (10WMDE-leszek) 03NEW [14:35:20] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2040-2041].codfw.wmnet [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:04] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1106038|ExtensionDistributor: Mark 1.43 as stable; remove 1.41 as EOL (T372331 T376550)]] (duration: 14m 14s) [14:38:08] T372331: Mark REL1_43 in ExtensionDistributor as a stable release - https://phabricator.wikimedia.org/T372331 [14:38:08] T376550: Formally EOL MW 1.41 - https://phabricator.wikimedia.org/T376550 [14:38:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby) [14:38:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1250.eqiad.wmnet with OS bookworm [14:39:03] (03Merged) 10jenkins-bot: Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108403 (https://phabricator.wikimedia.org/T381934) (owner: 10Jon Harald Søby) [14:39:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2040-2041].codfw.wmnet [14:39:19] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]] [14:39:22] T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934 [14:40:29] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1251.eqiad.wmnet with OS bookworm [14:40:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2041.codfw.wmnet with OS bookworm [14:40:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2040.codfw.wmnet with OS bookworm [14:40:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2041 [14:40:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2041 [14:40:58] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2040 [14:40:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2040 [14:44:47] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:51] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71810 and previous config saved to /var/cache/conftool/dbconfig/20250106-144534-ladsgroup.json [14:45:59] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jhsoby: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:02] T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934 [14:46:04] Jhs: please test :) [14:47:50] Lucas_WMDE, works as expected 👍 [14:49:14] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, jhsoby: Continuing with sync [14:49:16] \o/ [14:54:03] FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:56:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108403|Add bfw, gju-arab, gju-deva, hoc and kgg to wmgExtraLanguageNames (T381934)]] (duration: 17m 11s) [14:56:34] T381934: Add bfw, gju, hoc and kgg to language names - https://phabricator.wikimedia.org/T381934 [14:57:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor) [14:58:02] (03Merged) 10jenkins-bot: Remove EntitySchema DataType feature flag - is always enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100083 (https://phabricator.wikimedia.org/T333667) (owner: 10Arthur taylor) [14:58:21] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]] [14:58:23] T333667: [ES-M5] Remove temporary feature flag for EntitySchema Datatype again - https://phabricator.wikimedia.org/T333667 [14:58:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage [14:58:55] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage [15:00:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T371742)', diff saved to https://phabricator.wikimedia.org/P71811 and previous config saved to /var/cache/conftool/dbconfig/20250106-150040-ladsgroup.json [15:00:44] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [15:00:53] 07sre-alert-triage, 06Infrastructure-Foundations, 13Patch-For-Review: Alert in need of triage: PuppetConstantChange (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T382870#10432768 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:00:57] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [15:01:27] oh dear, the window’s already over? [15:01:30] jouncebot: now [15:01:30] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [15:01:33] ok phew [15:01:37] I’ll just keep deploying my config cleanup then [15:02:04] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:02:30] * Lucas_WMDE tests [15:03:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2041.codfw.wmnet with reason: host reimage [15:04:17] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, arthurtaylor: Continuing with sync [15:04:21] works afaict [15:05:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10432786 (10phaultfinder) [15:06:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1251.eqiad.wmnet with reason: host reimage [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:19] (03CR) 10Lucas Werkmeister (WMDE): "Looks technically fine but not deployed today per my comments on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107936 (https://phabricator.wikimedia.org/T382879) (owner: 10Novem Linguae) [15:09:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage [15:09:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [15:09:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [15:11:46] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100083|Remove EntitySchema DataType feature flag - is always enabled (T333667)]] (duration: 13m 25s) [15:11:49] T333667: [ES-M5] Remove temporary feature flag for EntitySchema Datatype again - https://phabricator.wikimedia.org/T333667 [15:12:08] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432813 (10MatthewVernon) I found nothing on the proxy-servers, but on ms-be2058 (the first node in the ring for this container), I find (`#012` in log line converted to new... [15:12:40] !log UTC afternoon backport+config window done [15:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:46] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10432815 (10Aklapper) [15:12:51] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10432816 (10Aklapper) [15:12:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:13:08] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:15:18] (03CR) 10AOkoth: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [15:17:43] (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [15:19:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:19:35] !incidents [15:19:35] 5580 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:19:37] !ack 5580 [15:19:38] 5580 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:20:21] here [15:22:53] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2041.codfw.wmnet with OS bookworm [15:23:28] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1244.eqiad.wmnet with OS bookworm [15:24:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:24:55] (03CR) 10JMeybohm: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:25:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1251.eqiad.wmnet with OS bookworm [15:27:03] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1252.eqiad.wmnet with OS bookworm [15:27:40] 06SRE, 06Traffic: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062 (10Diskdance) 03NEW [15:28:28] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2040.codfw.wmnet with OS bookworm [15:29:58] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:10] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2040.codfw.wmnet [15:31:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2040.codfw.wmnet [15:31:25] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2041.codfw.wmnet [15:31:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2041.codfw.wmnet [15:31:52] (03PS2) 10Ottomata: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) [15:32:00] (03CR) 10Ottomata: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:33:12] (03CR) 10Ssingh: Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:35:23] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432882 (10Jhancock.wm) going to replace the disk. two notes the server is out of warranty so it's a repurposed disk. getting an error on DIMM B6. going to replace it as well from decommed stock. Th... [15:36:08] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432883 (10Marostegui) Thank you @Jhancock.wm! [15:38:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2143.codfw.wmnet with reason: onsite maintenance [15:39:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2143.codfw.wmnet with reason: onsite maintenance [15:48:06] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw - https://phabricator.wikimedia.org/T383053#10432897 (10MatthewVernon) Similar errors similarly timestamped on the other two storage nodes ms-be2073 and ms-be2074 [15:51:09] !log uploaded openjdk-21 21.0.5+11-1~deb12u1 to apt.wikimedia.org component/jdk21 [15:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [15:55:23] !incidents [15:55:23] 5581 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:55:23] 5580 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [15:55:25] !ack 5581 [15:55:25] 5581 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [15:55:43] (03CR) 10Isabelle Hurbain-Palatin: "if one of you +1s this I'll schedule backport :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [16:01:03] (03CR) 10Subramanya Sastry: [C:03+1] Reactivate Parsoid+Kartographer on hewiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [16:04:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432947 (10Jhancock.wm) powered up and both alerts have cleared. Does everything look good on your end? @Marostegui [16:05:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:05:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432956 (10Marostegui) It looks good, the RAID is rebuilding: ` Slot Number: 2 Firmware state: Rebuild ` And the memory errors have vanished. I think we can close this! Thank you so much [16:06:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064 (10ops-monitoring-bot) 03NEW [16:06:50] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10432966 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm np! [16:08:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064#10432980 (10Jhancock.wm) gonna decline this one shortly. it popped up as we were fixing T382751 [16:11:05] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T383064#10432998 (10Marostegui) 05Open→03Declined The RAID is correctly rebuilding as part of T382751 [16:15:23] (03PS2) 10Abijeet Patro: Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) [16:25:27] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:13] (03PS21) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [16:26:45] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2021.codfw.wmnet - https://phabricator.wikimedia.org/T382944#10433081 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:28:26] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10433098 (10MatthewVernon) [16:28:52] (03PS1) 10Btullis: Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) [16:29:21] (03CR) 10Brouberol: [C:03+1] Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [16:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250106T1630) [16:33:17] (03CR) 10Btullis: [C:03+2] Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [16:33:26] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2020.codfw.wmnet - https://phabricator.wikimedia.org/T382945#10433113 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:37:04] (03Merged) 10jenkins-bot: Add pool for the dumps cephfs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108447 (https://phabricator.wikimedia.org/T382490) (owner: 10Btullis) [16:37:50] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:38:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:39:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:39:53] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:42:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [16:42:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [16:42:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T370903)', diff saved to https://phabricator.wikimedia.org/P71813 and previous config saved to /var/cache/conftool/dbconfig/20250106-164215-ladsgroup.json [16:42:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:42:29] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546) [16:44:04] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:44:07] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2022.codfw.wmnet - https://phabricator.wikimedia.org/T382946#10433154 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:44:43] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108451 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:47:20] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617) [16:48:29] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10433181 (10MatthewVernon) All three database files have different checksums, but the same failure of integrity check: ` mvernon@ms-be2073:~$ sqlite3 4077d9... [16:49:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2023.codfw.wmnet - https://phabricator.wikimedia.org/T383026#10433184 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:49:50] (03PS1) 10Marostegui: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108453 (https://phabricator.wikimedia.org/T383028) [16:50:29] (03CR) 10Marostegui: [C:03+2] es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108453 (https://phabricator.wikimedia.org/T383028) (owner: 10Marostegui) [16:51:04] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza) [16:52:35] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108452 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza) [16:52:47] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [16:53:38] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:54:10] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:54:11] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:54:52] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:54:54] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:54:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2004.codfw.wmnet - https://phabricator.wikimedia.org/T382877#10433201 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:55:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T370903)', diff saved to https://phabricator.wikimedia.org/P71815 and previous config saved to /var/cache/conftool/dbconfig/20250106-165503-ladsgroup.json [16:55:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:55:24] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:56:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1252.eqiad.wmnet with reason: host reimage [16:58:07] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1108451| Bumping portals to master (T128546)]] (duration: 12m 29s) [16:58:10] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:00:04] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#10433228 (10bd808) [17:00:51] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1108451| Bumping portals to master (T128546)]] (duration: 02m 43s) [17:05:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10433256 (10Papaul) The interface on this server is showing 100Mb/s it should be 1000Mb/s ` es1043:~$ sudo ethtool eno8303 | grep Speed Speed: 100Mb/s ` on the switch it self the speed is se... [17:09:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2003.codfw.wmnet - https://phabricator.wikimedia.org/T382875#10433261 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:10:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71816 and previous config saved to /var/cache/conftool/dbconfig/20250106-171010-ladsgroup.json [17:12:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10433270 (10Marostegui) Those hosts aren't in production and don't have alerting, so you can proceed as needed whenever you want! [17:13:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2001.codfw.wmnet - https://phabricator.wikimedia.org/T382867#10433273 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:13:34] (03PS1) 10David Caro: helm-sudo: use the right binary [puppet] - 10https://gerrit.wikimedia.org/r/1108455 [17:15:18] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1244.eqiad.wmnet with OS bookworm [17:15:31] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=1) rolling reimage on P{wikikube-worker[1240-1244].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:15:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1252.eqiad.wmnet with OS bookworm [17:15:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1250-1252].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:16:17] (03CR) 10David Caro: [C:03+2] helm-sudo: use the right binary [puppet] - 10https://gerrit.wikimedia.org/r/1108455 (owner: 10David Caro) [17:16:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 10decommission-hardware: decommission dbproxy2002.codfw.wmnet - https://phabricator.wikimedia.org/T382868#10433299 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:16:53] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1244.eqiad.wmnet with OS bookworm [17:22:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2116.codfw.wmnet - https://phabricator.wikimedia.org/T362950#10433344 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:25:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P71817 and previous config saved to /var/cache/conftool/dbconfig/20250106-172517-ladsgroup.json [17:27:17] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2115.codfw.wmnet - https://phabricator.wikimedia.org/T362949#10433403 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:28:16] (03CR) 10Herron: "Nice! Couple of nonblocking questions and thoughts for you inline, mostly about how instance overrides will/could work." [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [17:28:22] 10SRE-swift-storage, 10MediaViewer: PNG thumbnail of SVG file not displayed; returns "Unauthorized" error when attempting to view - https://phabricator.wikimedia.org/T383023#10433409 (10DavidEppstein) The two images I was having trouble viewing before are now good. Thanks! [17:32:54] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10433433 (10Jhancock.wm) @Andrew checking back on this one. anything i can help with? [17:35:35] 06SRE, 06Traffic: Reveal IP after click only on Varnish error pages - https://phabricator.wikimedia.org/T383062#10433442 (10Lucas_Werkmeister_WMDE) > For non-JavaScript fallback, we can just choose to show or hide the IP completely (Cloudflare chooses the latter). A [