[00:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451817 (10phaultfinder) [00:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451822 (10phaultfinder) [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459 (owner: 10TrainBranchBot) [00:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10451871 (10phaultfinder) [00:53:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10451876 (10VRiley-WMF) 05Open→03Resolved [00:54:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1110459 (owner: 10TrainBranchBot) [01:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460 [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460 (owner: 10TrainBranchBot) [01:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:33] ACKNOWLEDGEMENT - MD RAID on ms-be2075 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383530 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:13:39] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-be2075 - https://phabricator.wikimedia.org/T383530 (10ops-monitoring-bot) 03NEW [01:28:29] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1110460 (owner: 10TrainBranchBot) [01:57:16] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:58:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:52] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:21:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 1.138 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:26:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:57:16] FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:16] FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:21] FIRING: [6x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:17:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:24:43] (03PS1) 10Jelto: Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877) [07:38:49] (03CR) 10Muehlenhoff: [C:03+2] postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff) [07:41:44] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537 (10MoritzMuehlenhoff) 03NEW [07:41:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10452062 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:58:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T0800). nyaa~ [08:00:05] MatmaRex and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:05:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:46] !log updated netboot image for bookworm to 12.9 T383537 [08:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:50] T383537: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537 [08:28:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:30:28] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:12] !log hashar@deploy2002 Started deploy [integration/docroot@a81d82c]: build: Updating mediawiki/mediawiki-phan-config to 0.15.1 [08:34:22] !log hashar@deploy2002 Finished deploy [integration/docroot@a81d82c]: build: Updating mediawiki/mediawiki-phan-config to 0.15.1 (duration: 00m 09s) [08:43:30] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10452120 (10MoritzMuehlenhoff) [08:47:57] (03PS1) 10Marostegui: mariadb: Remove db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1110718 (https://phabricator.wikimedia.org/T383426) [08:49:03] (03CR) 10Jelto: [C:03+1] "lgtm when comparing with https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/platform/rbac.md#prometheus-rb" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [08:49:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2135.codfw.wmnet [08:51:07] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2135 [puppet] - 10https://gerrit.wikimedia.org/r/1110718 (https://phabricator.wikimedia.org/T383426) (owner: 10Marostegui) [08:52:36] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [08:53:57] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [08:57:19] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2135.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:57:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2135.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [08:57:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:57:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2135.codfw.wmnet [08:58:17] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10452155 (10Marostegui) a:05Marostegui→03None [08:59:44] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10452160 (10Marostegui) This is ready for #dc-ops [09:06:03] 06SRE, 06Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565#10452169 (10fgiunchedi) Yes there's pki support though it needs to be enabled fleet wide. I'll update the task description [09:08:12] 06SRE, 06Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565#10452170 (10fgiunchedi) [09:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:14] (03PS1) 10Muehlenhoff: Remove obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276) [09:17:06] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp [09:18:57] (03CR) 10Jelto: [C:03+1] "looks good to me, PCC diff has changes for the `tlsCertFile` and `readOnlyPort`" [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [09:23:54] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [09:25:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:00] (03CR) 10MVernon: [C:03+1] cassandra: rotate target_version 'dev' to '4.x' [puppet] - 10https://gerrit.wikimedia.org/r/1109767 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [09:30:12] (03CR) 10MVernon: [C:03+1] cassandra: set target_dev to 4.x (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [09:31:41] (03CR) 10Jelto: [C:03+1] "lgtm, PCC looks as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [09:36:49] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.ulsfo.wmnet} and A:cp [09:37:49] (03PS1) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 [09:38:10] (03CR) 10CI reject: [V:04-1] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff) [09:38:22] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.ulsfo.wmnet} and A:cp [09:38:42] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:38:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:39:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:41:20] !log dbmaint on pc5@eqiad (T382948) [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:25] T382948: ParserCache is not deleting old rows after three months past the expiry in the secondary datacenter - https://phabricator.wikimedia.org/T382948 [09:42:28] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db2212.codfw.wmnet [09:45:03] (03PS2) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 [09:45:36] (03PS2) 10Ladsgroup: Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) [09:45:42] (03CR) 10Ladsgroup: [C:03+2] Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup) [09:46:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup) [09:47:01] (03Merged) 10jenkins-bot: Add wikitech.wikimedia.org to list of local vhosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup) [09:47:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2212.codfw.wmnet [09:48:10] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]] [09:48:13] T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305 [09:48:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es5 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P71987 and previous config saved to /var/cache/conftool/dbconfig/20250113-094833-marostegui.json [09:48:36] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [09:48:43] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:48:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1023', diff saved to https://phabricator.wikimedia.org/P71988 and previous config saved to /var/cache/conftool/dbconfig/20250113-094846-marostegui.json [09:49:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1023.eqiad.wmnet with reason: cloning [09:49:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Reboot [09:49:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1023.eqiad.wmnet with reason: cloning [09:49:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2212.codfw.wmnet with reason: Reboot [09:50:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff) [09:50:55] (03PS1) 10Marostegui: mariadb: Productionize es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1110725 (https://phabricator.wikimedia.org/T382569) [09:51:23] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2042-2044].codfw.wmnet [09:51:55] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1044 [puppet] - 10https://gerrit.wikimedia.org/r/1110725 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [09:55:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2042-2044].codfw.wmnet [09:55:57] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[42-44] to wikikube-worker220[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1110664 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:56:46] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.ulsfo.wmnet} and A:cp [09:56:53] (03PS1) 10Marostegui: site.pp: Remove es1044 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1110727 [09:57:34] (03CR) 10Marostegui: [C:03+2] site.pp: Remove es1044 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1110727 (owner: 10Marostegui) [09:58:08] (03PS3) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 [09:59:44] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2042 to wikikube-worker2203 [10:00:03] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:00:05] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:00:06] T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305 [10:00:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:00:51] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:01:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning [10:01:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning [10:02:10] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1013.eqiad.wmnet with reason: cloning [10:02:31] (03PS4) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 [10:02:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1013.eqiad.wmnet with reason: cloning [10:02:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc2013.codfw.wmnet with reason: cloning [10:03:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2013.codfw.wmnet with reason: cloning [10:03:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2042 to wikikube-worker2203 - jelto@cumin1002" [10:03:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2042 to wikikube-worker2203 - jelto@cumin1002" [10:04:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:00] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2203 [10:04:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2203 [10:05:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2042 to wikikube-worker2203 [10:05:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71989 and previous config saved to /var/cache/conftool/dbconfig/20250113-100554-marostegui.json [10:05:58] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [10:06:44] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for pc2013.codfw.wmnet [10:07:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (owner: 10Muehlenhoff) [10:07:23] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for pc1015.eqiad.wmnet [10:07:44] !log Upgrade pc2013 pc1015 pc3 dbmaint eqiad codfw T383398 [10:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:57] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2043 to wikikube-worker2204 [10:08:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:08:48] (03PS1) 10Marostegui: pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) [10:08:53] (03CR) 10Filippo Giunchedi: [C:03+1] "This LGTM, please note that other check types like http and tcp will have to be fixed (here or in a separate review)" [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [10:09:07] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:09:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes2044:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2044 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:10:04] FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:10:38] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109752|Add wikitech.wikimedia.org to list of local vhosts (T376305)]] (duration: 22m 28s) [10:10:42] T376305: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305 [10:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Fix weights in pc3', diff saved to https://phabricator.wikimedia.org/P71990 and previous config saved to /var/cache/conftool/dbconfig/20250113-101132-marostegui.json [10:12:12] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2043 to wikikube-worker2204 - jelto@cumin1002" [10:12:13] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc2013.codfw.wmnet [10:13:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2043 to wikikube-worker2204 - jelto@cumin1002" [10:13:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:02] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2204 [10:13:22] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc1015.eqiad.wmnet [10:13:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2204 [10:14:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2043 to wikikube-worker2204 [10:14:23] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2044 to wikikube-worker2205 [10:14:43] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:15:36] (03PS5) 10Muehlenhoff: ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724) [10:16:14] (03CR) 10STran: [C:03+2] ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) (owner: 10STran) [10:16:30] (03CR) 10Marostegui: pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:16:34] (03CR) 10Marostegui: [C:03+2] pc1015: Move to pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1110729 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:17:56] (03Merged) 10jenkins-bot: ipoid: Bump activeDeadlineSeconds to 1 week [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109723 (https://phabricator.wikimedia.org/T374414) (owner: 10STran) [10:18:07] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2044 to wikikube-worker2205 - jelto@cumin1002" [10:18:32] (03PS1) 10Marostegui: pc1013: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/1110731 (https://phabricator.wikimedia.org/T383398) [10:18:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2044 to wikikube-worker2205 - jelto@cumin1002" [10:18:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:18:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2205 [10:19:08] (03CR) 10Marostegui: [C:03+2] pc1013: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/1110731 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [10:20:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove pc1015 from pc3', diff saved to https://phabricator.wikimedia.org/P71991 and previous config saved to /var/cache/conftool/dbconfig/20250113-102047-marostegui.json [10:21:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Make pc1013 master in pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71992 and previous config saved to /var/cache/conftool/dbconfig/20250113-102152-marostegui.json [10:21:56] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [10:23:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc3 T383398', diff saved to https://phabricator.wikimedia.org/P71993 and previous config saved to /var/cache/conftool/dbconfig/20250113-102343-marostegui.json [10:24:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2205 [10:25:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2044 to wikikube-worker2205 [10:25:37] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [10:25:39] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [10:26:21] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:26:24] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [10:27:15] (03PS1) 10Slyngshede: Provide additional information about users [software/bitu] - 10https://gerrit.wikimedia.org/r/1110732 (https://phabricator.wikimedia.org/T383201) [10:28:07] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:28:22] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:33:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning [10:33:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc1015.eqiad.wmnet with reason: cloning [10:36:23] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2203.codfw.wmnet wikikube-worker2204.codfw.wmnet wikikube-worker2205.codfw.wmnet on all recursors [10:36:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2203.codfw.wmnet wikikube-worker2204.codfw.wmnet wikikube-worker2205.codfw.wmnet on all recursors [10:36:51] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414) [10:38:06] 10SRE-swift-storage, 10UploadWizard, 07Unstewarded-production-error, 07Wikimedia-production-error: "Could not store upload in the stash (UploadStashFileException)" for 2.4 GiB TIF file - https://phabricator.wikimedia.org/T285341#10452580 (10MatthewVernon) I'm glad this worked on the second attempt. I'v... [10:39:47] (03PS2) 10Ladsgroup: mediawiki: Add Uncategorizedpages cron for commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1109526 (https://phabricator.wikimedia.org/T369024) [10:41:13] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2203.codfw.wmnet with OS bookworm [10:41:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71994 and previous config saved to /var/cache/conftool/dbconfig/20250113-104115-root.json [10:41:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2203 [10:41:30] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:41:57] (03CR) 10Clément Goubert: [C:03+1] decom wikikube-worker10[08-10,13,14,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/1109712 (https://phabricator.wikimedia.org/T375842) (owner: 10Kamila Součková) [10:42:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71995 and previous config saved to /var/cache/conftool/dbconfig/20250113-104250-root.json [10:43:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2240', diff saved to https://phabricator.wikimedia.org/P71996 and previous config saved to /var/cache/conftool/dbconfig/20250113-104310-marostegui.json [10:43:22] 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259#10452589 (10Volans) @BTullis following up from our chat on [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1104950/2/cookbooks/sre/aqs/__init__.py | this CR ]], when you have a chance le... [10:43:45] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2240.codfw.wmnet [10:44:51] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2203 - jelto@cumin1002" [10:44:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2203 - jelto@cumin1002" [10:44:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:44:56] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2203.codfw.wmnet 165.32.192.10.in-addr.arpa 5.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:44:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2203.codfw.wmnet 165.32.192.10.in-addr.arpa 5.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:44:59] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2203 [10:45:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2203 [10:45:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2203 [10:45:29] (03PS1) 10Marostegui: mariadb: Remove db2123 [puppet] - 10https://gerrit.wikimedia.org/r/1110737 (https://phabricator.wikimedia.org/T383388) [10:46:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2123.codfw.wmnet [10:47:22] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2123 [puppet] - 10https://gerrit.wikimedia.org/r/1110737 (https://phabricator.wikimedia.org/T383388) (owner: 10Marostegui) [10:49:34] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2240.codfw.wmnet [10:50:44] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [10:54:54] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2123.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:55:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2123.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [10:55:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2123.codfw.wmnet [10:55:40] PROBLEM - SSH on bast2003 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:56:23] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10452668 (10Marostegui) a:05Marostegui→03None [10:56:26] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10452673 (10Marostegui) This is ready for #dc-ops [10:56:40] RECOVERY - SSH on bast2003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:57:47] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2204.codfw.wmnet with OS bookworm [10:57:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2204 [10:58:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:58:15] (03CR) 10Marostegui: [C:03+1] Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1100) [11:00:22] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2003.codfw.wmnet with reason: os upgrade [11:00:37] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2003.codfw.wmnet with reason: os upgrade [11:00:48] 06SRE, 10Observability-Logging, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10452687 (10Clement_Goubert) >>! In T187078#10446147, @andrea.denisse wrote: > I think that having a list of the... [11:01:28] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:01:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2204 - jelto@cumin1002" [11:01:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2204 - jelto@cumin1002" [11:01:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:37] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2204.codfw.wmnet 164.32.192.10.in-addr.arpa 4.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:01:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2204.codfw.wmnet 164.32.192.10.in-addr.arpa 4.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:01:40] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2204 [11:01:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2204 [11:01:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2204 [11:02:52] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:03:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2205.codfw.wmnet with OS bookworm [11:03:18] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2205 [11:03:31] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:03:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2123 from dbctl for decommission', diff saved to https://phabricator.wikimedia.org/P71997 and previous config saved to /var/cache/conftool/dbconfig/20250113-110333-marostegui.json [11:03:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71998 and previous config saved to /var/cache/conftool/dbconfig/20250113-110336-root.json [11:04:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2203.codfw.wmnet with reason: host reimage [11:05:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2133,2160].codfw.wmnet with reason: cloning [11:05:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2133,2160].codfw.wmnet with reason: cloning [11:06:28] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:06:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2205 - jelto@cumin1002" [11:06:53] 06SRE, 10Cassandra, 10RESTBase-Cassandra: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424#10452718 (10fgiunchedi) Untagging o11y, AFAIK we haven't seen a reoccurrence of this. Though please reach out if things change [11:06:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2205 - jelto@cumin1002" [11:06:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:55] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2205.codfw.wmnet 230.48.192.10.in-addr.arpa 0.3.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:06:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2205.codfw.wmnet 230.48.192.10.in-addr.arpa 0.3.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:06:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2205 [11:07:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2203.codfw.wmnet with reason: host reimage [11:07:41] (03CR) 10FNegri: "AFAIU the current owners of the views definition are the Data Engineering team, so they should +1 this patch, but I'm not sure who exactly" [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [11:07:51] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [11:08:22] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash, 13Patch-For-Review: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497#10452724 (10fgiunchedi) 05Open→03Invalid iegreview is gone: {T334415} [11:09:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2205 [11:09:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2205 [11:09:56] !log installing pymysql security updates [11:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:31] (03PS1) 10Marostegui: mariadb: Productionize db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1110741 (https://phabricator.wikimedia.org/T373579) [11:12:16] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2233 [puppet] - 10https://gerrit.wikimedia.org/r/1110741 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:14:56] 06SRE, 10Observability-Logging, 07Security: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590#10452755 (10fgiunchedi) 05Open→03Declined I'm declining this I don't think it has been a problem in practice [11:18:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72000 and previous config saved to /var/cache/conftool/dbconfig/20250113-111842-root.json [11:18:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [11:18:59] 06SRE, 10Observability-Logging: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682#10452789 (10fgiunchedi) 05Open→03Declined Nowadays we have sampled webrequest available in superset and related dashboards, 5xx feed is in logstash though we could a... [11:19:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [11:19:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:20:19] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2204.codfw.wmnet with reason: host reimage [11:20:21] 06SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110#10452806 (10fgiunchedi) 05Open→03Declined I'm declining the task since webrequest sampled is available in superset and AFAIK that has been working well for SRE without the need to ac... [11:20:21] !log Move db2160:3322 under db2232 in m2 codfw dbmaint T373579 [11:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:24] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [11:21:05] (03CR) 10Máté Szabó: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414) (owner: 10STran) [11:22:29] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110735 (https://phabricator.wikimedia.org/T374414) (owner: 10STran) [11:24:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2204.codfw.wmnet with reason: host reimage [11:24:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [11:24:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2132,2160,2232].codfw.wmnet with reason: maintenance [11:24:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:25:55] (03PS1) 10Muehlenhoff: Revert "Remove access for aitolkyn" [puppet] - 10https://gerrit.wikimedia.org/r/1110743 [11:25:55] (03PS1) 10Muehlenhoff: Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744 [11:26:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2203.codfw.wmnet with OS bookworm [11:28:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2205.codfw.wmnet with reason: host reimage [11:29:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:31:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2205.codfw.wmnet with reason: host reimage [11:33:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72002 and previous config saved to /var/cache/conftool/dbconfig/20250113-113347-root.json [11:38:10] (03CR) 10JMeybohm: [C:03+2] admin_ng RBAC: Fix prometheus clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [11:40:33] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [11:41:47] (03Merged) 10jenkins-bot: admin_ng RBAC: Fix prometheus clusterrole [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109728 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [11:43:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2204.codfw.wmnet with OS bookworm [11:44:35] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:44:37] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:44:39] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:44:44] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:44:45] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:44:49] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:44:51] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:44:57] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:44:58] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:45:02] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:45:03] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:45:10] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:45:11] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:45:18] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:45:20] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:45:23] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:45:24] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:45:28] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:46:49] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:48:02] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:48:31] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:48:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72004 and previous config saved to /var/cache/conftool/dbconfig/20250113-114852-root.json [11:48:59] jouncebot: nowandnext [11:48:59] For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1100) [11:48:59] In 1 hour(s) and 11 minute(s): Create new tables for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1300) [11:49:14] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:49:18] (03PS1) 10Reedy: Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415) [11:49:25] (03CR) 10Reedy: [C:03+2] Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415) (owner: 10Reedy) [11:49:37] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:50:07] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:50:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2205.codfw.wmnet with OS bookworm [11:50:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Degraded RAID due to failed sdy on ms-be2075 - https://phabricator.wikimedia.org/T383530#10452947 (10MatthewVernon) p:05Triage→03High [11:50:55] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1110743 (owner: 10Muehlenhoff) [11:51:19] !log disabling puppet on all hosts running kubelet - T383413 [11:51:21] !log homer 'lsw1-c6-codfw*' commit 'T377877' [11:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:22] T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413 [11:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:26] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [11:51:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10452956 (10MatthewVernon) @Jhancock.wm one of the SSDs in this host looks unhappy now too (T383530), could you get that looked at at the same time, please? [11:52:27] (03CR) 10JMeybohm: [C:03+2] kubelet: Use the chained certificate for TLS [puppet] - 10https://gerrit.wikimedia.org/r/1109733 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [11:52:51] (03PS2) 10Ladsgroup: Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) [11:53:06] !log homer 'lsw1-d1-codfw*' commit 'T377877' [11:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:58] !log homer 'cr*codfw*' commit 'T377877' [11:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 140, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:56:00] (03CR) 10FNegri: Add new file tables to WMCS views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [11:57:55] !log re-enabling puppet on all hosts running kubelet - T383413 [11:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:58] T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413 [11:58:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2203-2205].codfw.wmnet [11:58:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2203-2205].codfw.wmnet [11:58:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:00:58] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10452988 (10Jelto) [12:02:49] (03Merged) 10jenkins-bot: Fix exceptions preventing user from continuing past license deeds [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110750 (https://phabricator.wikimedia.org/T383415) (owner: 10Reedy) [12:02:52] (03PS4) 10Hnowlan: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) [12:04:56] (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove access for aitolkyn" [puppet] - 10https://gerrit.wikimedia.org/r/1110743 (owner: 10Muehlenhoff) [12:05:11] (03PS1) 10Jelto: Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) [12:10:34] (03PS2) 10Muehlenhoff: Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744 [12:12:19] (03PS1) 10Reedy: Improve error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110757 (https://phabricator.wikimedia.org/T381333) [12:12:31] (03PS2) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182) [12:12:31] (03PS2) 10Reedy: Fix UW error summary [extensions/UploadWizard] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1110755 (https://phabricator.wikimedia.org/T383182) [12:13:23] (03PS1) 10Marostegui: dbproxy2005: Change m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579) [12:13:55] (03CR) 10Muehlenhoff: [C:03+2] Bump access date and update point of contact [puppet] - 10https://gerrit.wikimedia.org/r/1110744 (owner: 10Muehlenhoff) [12:17:45] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet [12:17:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453015 (10ops-monitoring-bot) depool host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet by kamila@cumin1002 with reason: Decommissioning nodes [12:18:37] !log reedy@deploy2002 Synchronized php-1.44.0-wmf.11/extensions/UploadWizard/: T383415 (duration: 13m 05s) [12:18:40] T383415: [wmf.11 - regression] Custom tags not working with UploadWizard - https://phabricator.wikimedia.org/T383415 [12:21:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1008-1010,1013-1014,1017-1018].eqiad.wmnet [12:21:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453027 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by kamila@cumin1002 depool for host wikikube-worker[1008-1010,1013-1014,1017-1018]... [12:24:02] (03PS1) 10Marostegui: wmnet: Change m2 master [dns] - 10https://gerrit.wikimedia.org/r/1110763 [12:24:29] (03CR) 10Marostegui: "Once the key is verified, this has my +1" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [12:25:25] !log Switch m2-master proxy [12:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:00] (03CR) 10Marostegui: [C:03+2] wmnet: Change m2 master [dns] - 10https://gerrit.wikimedia.org/r/1110763 (owner: 10Marostegui) [12:26:04] !log marostegui@dns1006 START - running authdns-update [12:26:43] (03CR) 10MVernon: [C:03+1] "mvernon@cumin2002:~$ host db2132" [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [12:27:09] (03CR) 10Marostegui: [C:03+2] dbproxy2005: Change m1 master [puppet] - 10https://gerrit.wikimedia.org/r/1110758 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [12:27:23] moritzm: ok to merge your change? [12:27:46] !log marostegui@dns1006 END - running authdns-update [12:27:49] moritzm: It looks very safe to merge, so mergning [12:27:51] merging [12:28:22] marostegui: sorry, yes [12:28:28] moritzm: Merged :( [12:28:30] :) [12:28:33] thx :-) [12:32:18] 06SRE: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 (10MoritzMuehlenhoff) 03NEW [12:32:25] 06SRE: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10453048 (10MoritzMuehlenhoff) p:05Triage→03High [12:42:40] (03PS1) 10Muehlenhoff: codesearch: Remove obsolete apt pinning code for buster [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479) [12:43:18] (03CR) 10Kamila Součková: [C:03+2] decom wikikube-worker10[08-10,13,14,17,18] [puppet] - 10https://gerrit.wikimedia.org/r/1109712 (https://phabricator.wikimedia.org/T375842) (owner: 10Kamila Součková) [12:47:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [12:47:55] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:47:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [12:47:55] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:57] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1008-1010].eqiad.wmnet [12:50:47] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts wikikube-worker[1008-1010].eqiad.wmnet [12:53:26] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1008-1010].eqiad.wmnet [12:57:16] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:40] FIRING: [6x] KubernetesRsyslogDown: rsyslog on wikikube-worker1009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:59:21] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:00:04] Daimona: It is that lovely time of the day again! You are hereby commanded to deploy Create new tables for the CampaignEvents extension. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1300). [13:03:31] o/ [13:03:37] o/ [13:04:29] I guess we can get started? [13:07:01] yes [13:07:52] (03PS1) 10Muehlenhoff: stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) [13:08:13] (03CR) 10CI reject: [V:04-1] stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [13:10:33] !log Creating new DB tables for the CampaignEvents extension in x1.testwiki, x1.test2wiki, x1.officewiki, and x1.wikishared # T379294 T381424 [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:38] T379294: Create new DB table for storing wikis of event - https://phabricator.wikimedia.org/T379294 [13:10:38] T381424: Create DB schema for storing topics of event - https://phabricator.wikimedia.org/T381424 [13:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:59] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:14:08] cmelo: I have created all tables everywhere. Can you test on metawiki that there's nothing broken? [13:14:20] I'll do testwiki [13:15:32] (03PS2) 10Muehlenhoff: stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) [13:16:51] And it seems fine to me [13:17:23] ok [13:18:12] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1008-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [13:18:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1008-1010].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [13:18:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:18:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1008-1010].eqiad.wmnet [13:18:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-worker[1008-1010].eqiad.wmnet` - wikikube-worker100... [13:19:24] Is meta ok? If so, we're done [13:19:53] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1013-1014,1017-1018].eqiad.wmnet [13:20:49] (03PS6) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [13:20:55] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:20:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:20:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:18] still testing on meta, sorry having password issues [13:21:40] just received a page [13:21:48] checking [13:21:55] arnaudb: ^ [13:21:56] !ack 74593 [13:21:56] Attempt to ack incident 74593 failed. [13:22:03] !incidents [13:22:03] 5588 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [13:22:06] (03CR) 10Volans: [C:03+1] "LGTM, verified key with Federico on a quick call." [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [13:22:15] I guess it did work :) [13:22:21] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:22:23] I acked it too from the app, maybe that's why [13:22:30] ah it was the inapp num, mybad haha [13:23:42] FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:09] !log bounce thanos-query on titan1* [13:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:50] Ok tested!! [13:25:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:34] OK great! I'm going to close the tasks then, and see you again here in 35 minutes :) [13:25:46] !log bounce thanos-store on titan1* [13:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:49] thank you!!! [13:25:50] (03PS2) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) [13:27:04] actually I'll depool eqiad from thanos [13:27:24] !log filippo@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-swift,name=eqiad [13:27:54] !log filippo@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-swift,name=eqiad [13:28:03] !log filippo@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=eqiad [13:28:15] that was my bad, I depooled thanos-swift not thanos-query, now fixed [13:28:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:10] (03PS2) 10Jelto: Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) [13:31:53] (03CR) 10CI reject: [V:04-1] Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [13:31:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:31:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:32:39] should recover soon [13:33:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:00] (03CR) 10AikoChou: [C:03+1] api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos) [13:35:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:09] bblack arnaudb I suspect root cause was a query of death, I'll dig deeper shortly and going back to lunch [13:38:24] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2134,2160,2234].codfw.wmnet with reason: maintenance [13:38:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2134,2160,2234].codfw.wmnet with reason: maintenance [13:39:17] PROBLEM - Host dbprov2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:23] (03CR) 10Klausman: [V:03+2 C:03+2] api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos) [13:40:03] (03CR) 10Jforrester: "Oops. Yes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109752 (https://phabricator.wikimedia.org/T376305) (owner: 10Ladsgroup) [13:40:42] (03Merged) 10jenkins-bot: api-gateway: add reference quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109666 (https://phabricator.wikimedia.org/T378495) (owner: 10Ilias Sarantopoulos) [13:40:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10453212 (10MatthewVernon) @VRiley-WMF when I look at the system now, the OS sees the extra disk (since 19:02:51 on 10 Jan, a few minutes after a reboot?). So I'm not sure what you... [13:41:24] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:41:45] RECOVERY - Host dbprov2003 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [13:42:21] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:43:24] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:09] (03PS1) 10Marostegui: mariadb: Productionize db2234 [puppet] - 10https://gerrit.wikimedia.org/r/1110772 (https://phabricator.wikimedia.org/T373579) [13:45:07] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2234 [puppet] - 10https://gerrit.wikimedia.org/r/1110772 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [13:46:13] (03CR) 10JMeybohm: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [13:48:02] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1013-1014,1017-1018].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [13:48:23] (03PS1) 10Slyngshede: Notify managers of closed requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1110773 [13:48:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1013-1014,1017-1018].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [13:48:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[1013-1014,1017-1018].eqiad.wmnet [13:48:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10453228 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-worker[1013-1014,1017-1018].eqiad.wmnet` - wikiku... [13:49:35] (03CR) 10Marostegui: [C:03+1] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [13:50:12] (03CR) 10Marostegui: [C:03+1] "Amir, as you are the onboarding buddy, can you merge and deploy?" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [13:51:09] (03CR) 10Ladsgroup: "Sure. I want to check with Moritz quickly and then check the key oob" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [13:52:10] !log homer cr*eqiad* commit 'wikikube decoms' [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72006 and previous config saved to /var/cache/conftool/dbconfig/20250113-135410-root.json [13:54:58] I caught someone's homer change again: something about `2a02:ec80:a000:fe01::1/64`, can I commit? [13:55:22] diff: https://www.irccloud.com/pastebin/9ti08ybs/ [13:58:52] kamila_: o/ for what device? [13:59:26] elukey: cr1-eqiad [13:59:39] I can check commits but last week topranks was adding configs for cloud, I guess it is safe but I can tell you in a sec [13:59:54] thanks a ton! [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1400). Please do the needful. [14:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:21] o/ [14:02:05] (03CR) 10JMeybohm: [C:03+2] prometheus::k8s: Move away from kubelet readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109734 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [14:02:42] kamila_: sorry, I added so many addresses in Netbox on Friday I must have forgot that one [14:02:57] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2040-2041].codfw.wmnet [14:02:57] it's ok to proceed thank you :) [14:03:01] no worries topranks :-) [14:03:07] thanks! [14:03:13] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:03:21] ok perfect :) [14:04:24] !log filippo@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-query,name=eqiad [14:04:52] I’m a bit busy rn but I can deploy if nobody else is available [14:04:59] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:05:19] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:05:53] (03CR) 10Muehlenhoff: [C:03+2] stat: Don't install go from backports [puppet] - 10https://gerrit.wikimedia.org/r/1110769 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:05:55] o/ [14:06:15] o/ [14:06:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2040-2041].codfw.wmnet [14:07:25] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts eventlog1003.eqiad.wmnet [14:07:33] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:08:06] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:09:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72008 and previous config saved to /var/cache/conftool/dbconfig/20250113-140916-root.json [14:09:48] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[40-41] to wikikube-worker220[6-7] [puppet] - 10https://gerrit.wikimedia.org/r/1110752 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [14:10:04] FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:10:40] (03CR) 10Muehlenhoff: "Just to close the loop; good to merge given the onboarding buddy thinks the onboarding has proceeded to the state where global root makes " [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [14:11:12] (03PS1) 10Btullis: Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) [14:12:23] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276) (owner: 10Muehlenhoff) [14:13:40] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2040 to wikikube-worker2206 [14:14:01] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:14:13] (03CR) 10Btullis: [C:03+1] "Late to the party, but thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1110721 (https://phabricator.wikimedia.org/T383276) (owner: 10Muehlenhoff) [14:15:04] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:34] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:35] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2041 to wikikube-worker2207 [14:17:01] alright, I should be able to deploy now [14:17:04] No other deployers around I assume? [14:17:11] Oh [14:17:22] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2040 to wikikube-worker2206 - jelto@cumin1002" [14:17:41] Lucas_WMDE: thank you! I'm sorry that it's always on you [14:17:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2040 to wikikube-worker2206 - jelto@cumin1002" [14:17:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:59] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2206 [14:18:16] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:18:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2206 [14:18:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) (owner: 10Daimona Eaytoy) [14:18:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2040 to wikikube-worker2206 [14:19:04] (03Merged) 10jenkins-bot: prod: Enable $wgCampaignEventsEnableEventWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109832 (https://phabricator.wikimedia.org/T380078) (owner: 10Daimona Eaytoy) [14:19:22] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]] [14:19:26] T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078 [14:21:25] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov2003.codfw.wmnet: Renew puppet certificate - root@cumin1002 [14:21:42] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2041 to wikikube-worker2207 - jelto@cumin1002" [14:21:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2041 to wikikube-worker2207 - jelto@cumin1002" [14:21:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:46] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2207 [14:22:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2207 [14:22:22] (03PS1) 10Ottomata: Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776 [14:22:42] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [14:22:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2041 to wikikube-worker2207 [14:23:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776 (owner: 10Ottomata) [14:23:31] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:24:02] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:24:13] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov2003.codfw.wmnet: Renew puppet certificate - root@cumin1002 [14:24:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72009 and previous config saved to /var/cache/conftool/dbconfig/20250113-142421-root.json [14:24:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:26] T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078 [14:24:29] Daimona: please test :) [14:25:02] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10453348 (10Kgraessle) >>! In T383241#10450029, @Dzahn wrote: >> no such identity: /Users/katherinegraessle/.ssh/prod.key: No such file or directory > > It is tryin... [14:25:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts eventlog1003.eqiad.wmnet [14:25:34] (03PS1) 10Ottomata: Revert^2 "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110777 [14:25:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis) [14:25:57] (03Abandoned) 10Ottomata: Revert "config: remove eventbus instrumentation setting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110776 (owner: 10Ottomata) [14:26:00] (03CR) 10Stevemunene: [C:03+1] Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis) [14:26:06] I can do testwiki again, maybe cmelo you can do meta and HouseOfM you take officewiki? [14:26:25] (03CR) 10Jforrester: [C:03+1] CommonSettings: Set 'lang=en' on Wikimedia Foundation entry in $wgFooterIcons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110053 (https://phabricator.wikimedia.org/T383501) (owner: 10Reedy) [14:26:37] Wait [14:26:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2206.codfw.wmnet wikikube-worker2207.codfw.wmnet on all recursors [14:26:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2206.codfw.wmnet wikikube-worker2207.codfw.wmnet on all recursors [14:26:52] Look around Ted, you're all alone [14:27:02] ohno [14:27:22] Well I'm going to do testwiki for the time being :D [14:29:27] mwdebug logstash looks clear so far [14:29:47] (03CR) 10Btullis: [C:03+2] Remove CNAME for eventlogging.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1110775 (https://phabricator.wikimedia.org/T383276) (owner: 10Btullis) [14:30:04] !log btullis@dns1004 START - running authdns-update [14:31:47] !log btullis@dns1004 END - running authdns-update [14:33:37] (03PS3) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) [14:34:56] I've done somewhat more extensive tests on testwiki and it looks good [14:35:09] But beta logstash seems broken [14:35:10] (03PS1) 10Volans: enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 [14:36:16] Ah, SNAFU, I see https://phabricator.wikimedia.org/T346402 [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2206.codfw.wmnet with OS bookworm [14:37:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2207.codfw.wmnet with OS bookworm [14:37:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2206 [14:37:32] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:37:40] wait this isn't beta [14:38:01] nvm, not enough caffeine [14:39:04] I was about to ask how this was relevant ^^ [14:39:21] still nothing in mwdebug logstash [14:39:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72010 and previous config saved to /var/cache/conftool/dbconfig/20250113-143926-root.json [14:39:29] one warning “inconsistent revision ID” and one info about executing pygmentize [14:39:47] hm [14:39:52] but the warning does come from https://test.wikipedia.org/wiki/Event:T380078?action=edit&veswitched=1 [14:39:53] T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078 [14:39:58] that’s an awfully suspicious page title isn’t it [14:40:24] * Lucas_WMDE codesearches [14:40:34] !log otto@deploy2002 Started deploy [analytics/refinery@f3945ee] (hadoop-test): gobblin eventlogging_legacy - use EventStreamConfig to pull topics [14:40:41] Yeah sorry, I just opened the right logstash :D Indeed, no errors. That's the page I was using to test, and I'm pretty sure that error happens all the time in prod [14:40:41] apparently it comes from this https://gerrit.wikimedia.org/g/mediawiki/core/+/f1f6f7cfe6494fb05b8f626b829897b84c0217d8/includes/parser/ParserCache.php#460 [14:40:53] !log dcausse@deploy2002 Started deploy [airflow-dags/search@8c96899]: search: fix glent, import_cirrus_indexes and transfer_to_es [14:41:00] !log otto@deploy2002 Finished deploy [analytics/refinery@f3945ee] (hadoop-test): gobblin eventlogging_legacy - use EventStreamConfig to pull topics (duration: 01m 37s) [14:41:04] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2206 - jelto@cumin1002" [14:41:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2206 - jelto@cumin1002" [14:41:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:08] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2206.codfw.wmnet 167.32.192.10.in-addr.arpa 7.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:41:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2206.codfw.wmnet 167.32.192.10.in-addr.arpa 7.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:41:11] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2206 [14:41:12] you’re right, it’s the #5 entry on mediawiki-warnings [14:41:14] probably okay to ignore then [14:41:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2206 [14:41:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2206 [14:41:32] do you want to wait for the others or is it okay to deploy? [14:41:56] !log disabling puppet on all hosts running kubelet - T383413 [14:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413 [14:42:01] Yeah it's OK but I want to make sure that there's a task for it, because 8k errors in 15 minutes is definitely spam [14:42:04] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2207 [14:42:09] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:42:37] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@8c96899]: search: fix glent, import_cirrus_indexes and transfer_to_es (duration: 01m 44s) [14:42:45] T358708 apparently? [14:42:45] T358708: Inconsistent revision ID - https://phabricator.wikimedia.org/T358708 [14:43:01] (03CR) 10JMeybohm: [C:03+2] kubelet: Disable the readOnlyPort [puppet] - 10https://gerrit.wikimedia.org/r/1109735 (https://phabricator.wikimedia.org/T383413) (owner: 10JMeybohm) [14:43:43] Yup just got there. I'm going to comment because from the task it isn't clear what the volume of these warnings is [14:43:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:43:56] ok [14:44:33] !log otto@deploy2002 Started deploy [analytics/refinery@f3945ee]: gobblin eventlogging_legacy - use EventStreamConfig to pull topics [14:45:34] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2207 - jelto@cumin1002" [14:45:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2207 - jelto@cumin1002" [14:45:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2207.codfw.wmnet 166.32.192.10.in-addr.arpa 6.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:45:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2207.codfw.wmnet 166.32.192.10.in-addr.arpa 6.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:45:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2207 [14:46:01] !log otto@deploy2002 Finished deploy [analytics/refinery@f3945ee]: gobblin eventlogging_legacy - use EventStreamConfig to pull topics (duration: 01m 27s) [14:47:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2207 [14:47:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2207 [14:47:32] (03CR) 10Cathal Mooney: [C:03+1] "Looks good based on the description but I'll need to take your word this is the required fix :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 (owner: 10Volans) [14:48:28] !log re-enabling puppet on all hosts running kubelet - T383413 [14:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:31] T383413: Remove the kubelet readOnlyPort - https://phabricator.wikimedia.org/T383413 [14:49:19] !log installing glibc bugfix updates for Bookworm [14:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:23] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10453489 (10cmooney) @dcaro is there anything left to be done here? I see traffic profiled in the low and high classes across the cloud switc... [14:51:27] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109832|prod: Enable $wgCampaignEventsEnableEventWikis (T380078)]] (duration: 32m 04s) [14:51:30] T380078: Enable the event wikis feature in production - https://phabricator.wikimedia.org/T380078 [14:52:10] !log UTC afternoon backport+config window done [14:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] Noice, thank you! [14:53:25] Thank you! [14:53:44] (03CR) 10Volans: [C:03+1] "As I was not involved directly in Federico's onboarding, my +1 is purely on the key verification part :)" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [14:53:47] :) wonderful as always Lucas_WMDE [14:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2240 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72011 and previous config saved to /var/cache/conftool/dbconfig/20250113-145432-root.json [14:55:18] (03CR) 10FNegri: "Removing my -1 after discussing with Joanna and the rest of the WMCS team. While we might need more permission levels for other people in " [puppet] - 10https://gerrit.wikimedia.org/r/1087919 (https://phabricator.wikimedia.org/T379159) (owner: 10FNegri) [14:55:25] (03PS2) 10Volans: enum: remove type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/1110778 [14:55:35] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q1-Q2), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453508 (10fnegri) 05Declined→03Open Reopening after discussing with @joanna_borun and the rest of the WMCS team. Whi... [14:55:45] 06SRE, 10SRE-Access-Requests, 10cloud-services-team (FY2024/2025-Q3-Q4), 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453510 (10fnegri) [14:55:50] np :) [14:57:38] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 2 others: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10453514 (10Tgr) a:03Tgr [14:58:51] !log btullis@deploy2002 Started deploy [airflow-dags/search@8c96899]: (no justification provided) [14:59:05] (03PS1) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) [14:59:13] !log btullis@deploy2002 Finished deploy [airflow-dags/search@8c96899]: (no justification provided) (duration: 00m 24s) [14:59:59] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2206.codfw.wmnet with reason: host reimage [15:01:36] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 2 others: Set up auth.wikimedia.org - https://phabricator.wikimedia.org/T377187#10453543 (10Tgr) Notes from @elukey on IRC: > 17:12 < elukey> IIUC the config needs to run on the deployment servers via puppet run, so the correspondent yaml files for he... [15:02:09] (03PS1) 10Clément Goubert: mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) [15:03:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2206.codfw.wmnet with reason: host reimage [15:06:02] (03CR) 10Btullis: "There are also references in site.pp and preseed.yaml as well as hieradata/role/common/kafka/jumbo/broker.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:06:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2207.codfw.wmnet with reason: host reimage [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:37] (03CR) 10Hnowlan: [C:03+1] mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert) [15:09:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2207.codfw.wmnet with reason: host reimage [15:09:58] (03CR) 10Clément Goubert: [C:03+2] mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert) [15:11:04] (03Merged) 10jenkins-bot: mw-jobrunner: Log apache via rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110786 (https://phabricator.wikimedia.org/T293943) (owner: 10Clément Goubert) [15:11:56] (03CR) 10Ladsgroup: "A lot of those parts have been done and the rest will be done in pair sessions. Given the seniority, I feel it's okay to go directly to th" [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [15:12:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [15:14:03] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [15:15:15] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [15:16:19] (03CR) 10JHathaway: [C:03+1] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [15:16:29] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [15:21:25] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383242#10453639 (10Jhancock.wm) 05Open→03Declined side effect of T383225 [15:21:54] (03CR) 10Herron: [C:03+1] "Thanks! Couple of minor comments inline" [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [15:21:54] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383307#10453646 (10Jhancock.wm) 05Open→03Declined side effect of T383225 [15:22:37] (03PS4) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) [15:23:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2206.codfw.wmnet with OS bookworm [15:23:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2234.codfw.wmnet with reason: maintenance [15:23:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2234.codfw.wmnet with reason: maintenance [15:28:08] (03PS1) 10Filippo Giunchedi: thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) [15:28:12] sudo dbctl instance db2128 depool [15:28:12] sudo dbctl config commit -m "Depool db2128 T383572" [15:28:13] T383572: decommission db2128.codfw.wmnet - https://phabricator.wikimedia.org/T383572 [15:28:18] Great :) [15:28:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2128 T383572', diff saved to https://phabricator.wikimedia.org/P72012 and previous config saved to /var/cache/conftool/dbconfig/20250113-152828-marostegui.json [15:28:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2207.codfw.wmnet with OS bookworm [15:29:05] (03PS1) 10Marostegui: instances.yaml: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1110799 (https://phabricator.wikimedia.org/T383572) [15:29:42] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2128 [puppet] - 10https://gerrit.wikimedia.org/r/1110799 (https://phabricator.wikimedia.org/T383572) (owner: 10Marostegui) [15:30:17] !log homer 'lsw1-c5-codfw*' commit 'T377877' [15:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:30:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2128 from dbctl T383572', diff saved to https://phabricator.wikimedia.org/P72013 and previous config saved to /var/cache/conftool/dbconfig/20250113-153046-marostegui.json [15:31:29] (03PS1) 10Marostegui: db2128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110800 (https://phabricator.wikimedia.org/T383572) [15:31:34] !log homer 'cr*codfw*' commit 'T377877' [15:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] (03CR) 10Marostegui: [C:03+2] db2128: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110800 (https://phabricator.wikimedia.org/T383572) (owner: 10Marostegui) [15:32:12] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 136, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:32] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbprov2004.codfw.wmnet with reason: reboot [15:32:46] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbprov2004.codfw.wmnet with reason: reboot [15:32:59] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2206-2207].codfw.wmnet [15:33:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2206-2207].codfw.wmnet [15:33:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10453759 (10Jelto) [15:35:09] (03CR) 10Herron: [C:03+1] thanos-query: write active queries to file [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [15:37:45] (03CR) 10Herron: [C:03+1] prometheus: k8s instances migration to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:39:50] 10SRE-tools, 06Data-Persistence-Automations, 06DBA, 06Infrastructure-Foundations, and 2 others: spicerack mysql_legacy: support fetch metrics for instance - https://phabricator.wikimedia.org/T376596#10453800 (10ABran-WMF) a:05ABran-WMF→03None [15:40:48] (03PS1) 10Jelto: Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877) [15:41:10] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:41:24] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10453812 (10fgiunchedi) [15:41:54] (03PS1) 10Marostegui: db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110803 (https://phabricator.wikimedia.org/T374623) [15:42:02] (03PS3) 10Federico Ceratto: Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 [15:42:04] (03CR) 10Ladsgroup: [C:03+2] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [15:42:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add myself (fceratto) to ops [puppet] - 10https://gerrit.wikimedia.org/r/1109716 (owner: 10Federico Ceratto) [15:42:10] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:42:21] (03CR) 10Marostegui: [C:03+2] db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110803 (https://phabricator.wikimedia.org/T374623) (owner: 10Marostegui) [15:42:22] (03PS5) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) [15:43:19] 06SRE, 06Data-Platform-SRE, 10Observability-Metrics, 10superset.wikimedia.org: statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761#10453823 (10fgiunchedi) 05Open→03Invalid superset has moved to k8s in the meantime, this task doesn't apply anymore [15:43:52] (03PS6) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) [15:44:15] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on dbprov2005.codfw.wmnet with reason: os upgrade [15:44:32] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbprov2005.codfw.wmnet with reason: os upgrade [15:44:59] (03CR) 10JMeybohm: [C:03+1] Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [15:46:43] (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [15:47:06] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2416-2419].codfw.wmnet [15:47:14] (03PS7) 10Tiziano Fogli: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) [15:48:34] (03PS2) 10Dzahn: Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730) [15:48:40] (03PS3) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) [15:51:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es1041 as es4 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P72014 and previous config saved to /var/cache/conftool/dbconfig/20250113-155135-marostegui.json [15:51:39] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [15:51:42] (03CR) 10Dzahn: [C:03+2] Revert "Add uz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109778 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [15:51:46] (03PS2) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) [15:51:50] (03CR) 10Ottomata: "oo nice catch." [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:51:51] !log dzahn@dns1006 START - running authdns-update [15:51:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1020 T382569', diff saved to https://phabricator.wikimedia.org/P72015 and previous config saved to /var/cache/conftool/dbconfig/20250113-155153-marostegui.json [15:52:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2416-2419].codfw.wmnet [15:52:26] !log DNS - removing uz.wikimedia.org - wiki was never created (T382730) [15:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:29] T382730: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730 [15:52:46] (03CR) 10Jelto: [C:03+2] Rename mw241[6-9] to wikikube-worker22[08-11] [puppet] - 10https://gerrit.wikimedia.org/r/1110802 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [15:52:48] (03CR) 10Ottomata: "I wasn't sure if I should remove the ones in e.g. modules/profile/files/sre/bullseye.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:53:08] (03PS1) 10Marostegui: es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110805 (https://phabricator.wikimedia.org/T383199) [15:53:36] !log dzahn@dns1006 END - running authdns-update [15:53:49] !log DNS - removing uz.wikimedia.org - wiki was never created (T270987) [15:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:54] T270987: Create a wiki for Wikimedians of the Uzbek language User Group - https://phabricator.wikimedia.org/T270987 [15:54:11] (03CR) 10Marostegui: [C:03+2] es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110805 (https://phabricator.wikimedia.org/T383199) (owner: 10Marostegui) [15:55:05] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2416 to wikikube-worker2208 [15:55:26] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [15:55:28] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [15:55:28] status [15:57:21] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10453986 (10dcaro) >>! In T371501#10453489, @cmooney wrote: > @dcaro is there anything left to be done here? I see traffic profiled in the lo... [15:57:53] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302#10453997 (10cmooney) p:05Triage→03Low [15:57:55] (03PS1) 10Ottomata: logstash - remove legacy eventlogging related input and filters [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230) [15:57:55] (03CR) 10Tiziano Fogli: "I also moved the new alert to a new file that is globally deployed (i.e., on Thanos)." [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [15:58:12] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [15:58:12] status [15:58:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:58:53] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2416 to wikikube-worker2208 - jelto@cumin1002" [15:59:03] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [15:59:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2416 to wikikube-worker2208 - jelto@cumin1002" [15:59:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:08] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2208 [15:59:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2208 [16:00:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2416 to wikikube-worker2208 [16:01:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2417 to wikikube-worker2209 [16:01:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:02:46] (03PS3) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) [16:04:51] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2417 to wikikube-worker2209 - jelto@cumin1002" [16:05:15] (03CR) 10Hnowlan: [C:03+1] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková) [16:05:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2417 to wikikube-worker2209 - jelto@cumin1002" [16:05:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:05:18] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2209 [16:05:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2209 [16:05:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw2418:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:06:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2417 to wikikube-worker2209 [16:06:06] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303#10454034 (10cmooney) p:05Triage→03Medium a:03cmooney Thanks for the task. It's firing because the fasw switch interfaces are enabled but not... [16:08:10] (03CR) 10Muehlenhoff: "These should stay" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:08:58] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303#10454052 (10cmooney) [16:08:59] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802#10454053 (10cmooney) [16:09:03] (03CR) 10Muehlenhoff: "Looks good, only thing missing is the record in manifests/site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:09:05] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2418 to wikikube-worker2210 [16:09:26] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:09:42] (03CR) 10Muehlenhoff: [C:03+2] ganeti::known_hosts: Sort hosts [puppet] - 10https://gerrit.wikimedia.org/r/1110724 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [16:09:45] PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:16] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:12:54] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2418 to wikikube-worker2210 - jelto@cumin1002" [16:13:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2418 to wikikube-worker2210 - jelto@cumin1002" [16:13:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:14] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2210 [16:13:19] RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [16:13:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2210 [16:13:38] (03CR) 10Tiziano Fogli: [C:03+2] thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [16:14:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2418 to wikikube-worker2210 [16:14:25] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2419 to wikikube-worker2211 [16:14:47] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:14:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:14:52] (03Merged) 10jenkins-bot: thanos-rule: search for gaps in thanos-rule recording rules [alerts] - 10https://gerrit.wikimedia.org/r/1110747 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [16:18:44] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2419 to wikikube-worker2211 - jelto@cumin1002" [16:19:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2419 to wikikube-worker2211 - jelto@cumin1002" [16:19:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:19:02] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2211 [16:19:21] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2211 [16:19:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10454097 (10MoritzMuehlenhoff) [16:19:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:20:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2419 to wikikube-worker2211 [16:20:22] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2208.codfw.wmnet wikikube-worker2209.codfw.wmnet wikikube-worker2210.codfw.wmnet wikikube-worker2211.codfw.wmnet on all recursors [16:20:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2208.codfw.wmnet wikikube-worker2209.codfw.wmnet wikikube-worker2210.codfw.wmnet wikikube-worker2211.codfw.wmnet on all recursors [16:23:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2208.codfw.wmnet with OS bookworm [16:23:45] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2208 [16:23:55] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:27:15] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2208 - jelto@cumin1002" [16:27:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2208 - jelto@cumin1002" [16:27:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:19] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2208.codfw.wmnet 63.32.192.10.in-addr.arpa 3.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:27:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2208.codfw.wmnet 63.32.192.10.in-addr.arpa 3.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:27:23] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2208 [16:27:29] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Add cloud-private v6 supernets [puppet] - 10https://gerrit.wikimedia.org/r/1109983 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [16:27:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2208 [16:27:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2208 [16:28:30] (03PS4) 10Ottomata: Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) [16:28:35] (03CR) 10Ottomata: "Oh! VM is removed. Done." [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:29:12] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:30:05] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1630). [16:30:05] (03CR) 10Ottomata: [C:03+2] Remove some unused eventlogging references [puppet] - 10https://gerrit.wikimedia.org/r/1110787 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:30:31] (03PS1) 10Marostegui: db2232: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110811 [16:30:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10454201 (10phaultfinder) [16:30:55] (03CR) 10Marostegui: [C:03+2] db2232: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1110811 (owner: 10Marostegui) [16:32:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2209.codfw.wmnet with OS bookworm [16:32:12] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2209 [16:32:27] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:35:48] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2209 - jelto@cumin1002" [16:35:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2209 - jelto@cumin1002" [16:35:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:35:52] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2209.codfw.wmnet 64.32.192.10.in-addr.arpa 4.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:35:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2209.codfw.wmnet 64.32.192.10.in-addr.arpa 4.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:35:55] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2209 [16:35:57] (03PS3) 10JMeybohm: k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) [16:35:58] (03PS1) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [16:36:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2209 [16:36:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2209 [16:38:00] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:38:35] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2210.codfw.wmnet with OS bookworm [16:38:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2210 [16:38:57] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:42:44] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10454264 (10Jhancock.wm) 05Open→03Resolved [16:43:11] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2210 - jelto@cumin1002" [16:43:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2210 - jelto@cumin1002" [16:43:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:16] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2210.codfw.wmnet 65.32.192.10.in-addr.arpa 5.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:43:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2210.codfw.wmnet 65.32.192.10.in-addr.arpa 5.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:43:19] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2210 [16:44:00] (03PS1) 10DLynch: Set Flow to read-only on phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) [16:44:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2208.codfw.wmnet with reason: host reimage [16:44:46] (03CR) 10DLynch: "This *doesn't* include cawiki and mediawikiwiki because they need further processing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch) [16:45:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch) [16:45:44] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10454289 (10elukey) Tried to copy the storcli64 binary to ms and presto nodes, these are the results: ` elukey@ms... [16:46:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2210 [16:46:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2210 [16:46:05] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10454291 (10Kgraessle) @Dzahn I had a typo in my ~/.ssh/config, please disregard my last comment. This is working and I am able to connect successfully. We can c... [16:48:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2208.codfw.wmnet with reason: host reimage [16:48:11] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10454316 (10elukey) [16:49:11] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10454321 (10Dzahn) 05Open→03Resolved a:03Dzahn @Kgraessle Perfect! Great to hear it works and thanks for the update. [16:51:00] PROBLEM - Host dbprov2005 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:28] that's me, expired downtime [16:51:31] ignore [16:51:46] should be up soon [16:51:48] RECOVERY - Host dbprov2005 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [16:52:47] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2209.codfw.wmnet with reason: host reimage [16:53:55] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2211.codfw.wmnet with OS bookworm [16:53:58] jouncebot: nowandnext [16:53:58] For the next 0 hour(s) and 6 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1630) [16:53:58] In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800) [16:53:58] In 1 hour(s) and 6 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800) [16:54:06] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2211 [16:54:42] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:56:10] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov2005.codfw.wmnet: Renew puppet certificate - root@cumin1002 [16:57:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2209.codfw.wmnet with reason: host reimage [16:58:27] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2211 - jelto@cumin1002" [16:58:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2211 - jelto@cumin1002" [16:58:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:32] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2211.codfw.wmnet 66.32.192.10.in-addr.arpa 6.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:58:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2211.codfw.wmnet 66.32.192.10.in-addr.arpa 6.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:58:36] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2211 [16:58:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1076-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:58:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2211 [16:58:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2211 [16:59:06] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov2005.codfw.wmnet: Renew puppet certificate - root@cumin1002 [16:59:32] (03PS2) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:00:19] (03CR) 10CDanis: OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:02:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [17:03:07] (03PS3) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:03:35] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2210.codfw.wmnet with reason: host reimage [17:06:17] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Update whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1110819 [17:06:39] (03CR) 10Clément Goubert: [C:03+1] OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:06:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2210.codfw.wmnet with reason: host reimage [17:07:50] (03PS1) 10Scott French: mediawiki: enable mesh telemetry in mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 [17:08:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2208.codfw.wmnet with OS bookworm [17:09:49] jouncebot: nowandnext [17:09:49] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [17:09:50] In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800) [17:09:50] In 0 hour(s) and 50 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800) [17:09:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:10:11] let's get tracing [17:10:44] (03Merged) 10jenkins-bot: OpenTelemetry tracing to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109754 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:11:03] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] [17:11:07] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [17:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:04] (03PS3) 10Dzahn: Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) [17:14:12] (03CR) 10Dzahn: [C:03+1] "manual rebase" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [17:14:17] (03CR) 10Dzahn: [C:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [17:15:24] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2211.codfw.wmnet with reason: host reimage [17:15:38] (03PS4) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:15:51] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:15:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2209.codfw.wmnet with OS bookworm [17:16:58] !log cdanis@deploy2002 cdanis: Continuing with sync [17:16:59] (03CR) 10Dzahn: [C:03+2] Revert "add za.wikimedia.org and za.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/1109777 (https://phabricator.wikimedia.org/T382730) (owner: 10Dzahn) [17:17:16] !log dzahn@dns1006 START - running authdns-update [17:18:18] !log DNS - removing za.wikimedia.org and za.m.wikimedia.org - wiki was not created (T382730, T195926) [17:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:24] T382730: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730 [17:18:24] T195926: Create wiki for Wikimedia South Africa - https://phabricator.wikimedia.org/T195926 [17:18:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2211.codfw.wmnet with reason: host reimage [17:19:03] !log dzahn@dns1006 END - running authdns-update [17:19:10] (03PS5) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:21:55] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10454523 (10Dzahn) 05In progress→03Resolved @Dylsss Thanks for reporting this! The 2 DNS r... [17:24:01] (03PS6) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10454539 (10phaultfinder) [17:25:04] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] (duration: 14m 00s) [17:25:08] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [17:25:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:48] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10454546 (10Dylsss) Thanks for actioning! [17:26:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2210.codfw.wmnet with OS bookworm [17:28:34] (03CR) 10Dzahn: [C:03+2] certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 7 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [17:28:53] 10ops-codfw, 06SRE, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10454562 (10Jhancock.wm) @Jelto reseated all the cables on the backplane. give it another go and let me know if it needs another look. [17:29:14] 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10454563 (10matmarex) [17:30:02] (03PS7) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:31:41] 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10454584 (10matmarex) Log for an example request: https://beta-logs.wmcloud.org/goto/a404432dceca139889a... [17:32:56] (03CR) 10Dzahn: [C:03+2] "ran puppet on acmechief2002 and it looked fine. it added to /etc/acme-chief/config.yaml and refreshed acme-chief service" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [17:33:19] mutante: acme-chief runs on 1002 :) [17:33:34] I triggered a puppet run there, and the certificate has been issued [17:33:50] vgutierrez: ack, I picked a random one from output of "cumin acme*", just wanted to seen one puppet run to work after merge [17:34:00] vgutierrez: thanks, ok:) [17:34:49] PROBLEM - HTTPS non-canonical-redirect-5 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:34:55] uh? [17:35:00] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:35:06] PROBLEM - HTTPS non-canonical-redirect-6 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:35:10] PROBLEM - HTTPS non-canonical-redirect-3 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:35:12] PROBLEM - HTTPS non-canonical-redirect-4 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:35:12] PROBLEM - HTTPS non-canonical-redirect-2 on ncredir1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [17:35:13] hmm [17:35:22] awwwr [17:35:30] nginx crashed on ncredir1001 [17:35:34] Jan 13 17:32:23 ncredir1001 nginx[2048770]: 2025/01/13 17:32:23 [warn] 2048770#2048770: could not build optimal map_hash, you should increase either map_hash_max_size: 2048 or map_hash_bucket_size: 64; ignoring map_hash_bucket_size [17:35:55] Jan 13 17:32:23 ncredir1001 nginx[2048770]: 2025/01/13 17:32:23 [emerg] 2048770#2048770: BIO_new_file("/etc/acmecerts/non-canonical-redirect-7/live/ec-prime256v1.ocsp") failed (SSL: error:80000002:system library::No such file or directory:calling fopen(/etc/acmecerts/non-canonical-redirect-7/live/ec-prime256v1.ocsp, rb) [17:35:57] soo. a new section was added [17:36:02] section 7 [17:36:04] bad timing? [17:36:27] let me trigger a puppet run on ncredir1001 [17:36:40] it cant handle more than 8 certs? [17:36:45] 0 to 7 or something [17:37:48] RECOVERY - HTTPS non-canonical-redirect-5 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikimedia.is has 289931 seconds left:Certificate wikimedia.is valid until 2025-04-06 06:57:02 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:37:54] also looking at puppet, but ncredir1002 [17:37:55] vgutierrez: ^ puppet run? [17:37:58] yes sukhe [17:38:00] looks good [17:38:02] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 237178 seconds left:Certificate wikipedia.com valid until 2025-03-30 22:53:54 +0000 (expires in 76 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:38:06] RECOVERY - HTTPS non-canonical-redirect-6 on ncredir1001 is OK: SSL OK - OCSP staple validity for wikipedia.fi has 554813 seconds left:Certificate wikipedia.fi valid until 2025-02-27 05:38:56 +0000 (expires in 44 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:38:10] RECOVERY - HTTPS non-canonical-redirect-3 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikipedia.bg has 560869 seconds left:Certificate *.wikipedia.bg valid until 2025-02-07 03:20:41 +0000 (expires in 24 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:38:12] RECOVERY - HTTPS non-canonical-redirect-4 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikispecies.net has 227807 seconds left:Certificate *.wikispecies.net valid until 2025-03-21 05:49:19 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:38:12] RECOVERY - HTTPS non-canonical-redirect-2 on ncredir1001 is OK: SSL OK - OCSP staple validity for www.wikimania.com has 456107 seconds left:Certificate *.wikimania.com valid until 2025-03-21 07:49:44 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/Ncredir [17:38:17] phew:) [17:38:19] vgutierrez: nice thanks [17:38:22] so nginx tried to configure non-canonical-redirect-7 before acme-chief deployed it there [17:38:24] :] [17:38:30] aaa [17:38:38] aah [17:38:48] https://www.irccloud.com/pastebin/gr2fWV0O/ [17:38:53] pretty bad race condition [17:38:56] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2123.codfw.wmnet - https://phabricator.wikimedia.org/T383388#10454634 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:39:15] it looks like puppet run on ncredir1001 during non-canonical-redirect-7 issuance process [17:39:23] so it got the snakeoil cert rather than the good one [17:39:25] You know, on Friday afternoon I looked at this and was like "yea, no, dont merge Fridays" [17:39:30] glad you were here too [17:39:49] but puppet run fixing it is cool of course [17:40:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2211.codfw.wmnet with OS bookworm [17:40:49] does it have certs for wikipedia.ro and wikimedia.ro now [17:41:30] on ncredir1002 it did not crash on puppet run, *nod* [17:42:49] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2126.codfw.wmnet - https://phabricator.wikimedia.org/T383395#10454666 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:44:16] !log homer 'lsw1-c3-codfw*' commit 'T377877' [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:26] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [17:45:13] !log sudo homer 'cr*codfw*' commit 'T377877' [17:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:27] vgutierrez: I checked with openssl too and I see the .ro names on the "live" file. all good :) ttyl [17:46:24] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 128, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2208-2211].codfw.wmnet [17:46:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2208-2211].codfw.wmnet [17:47:37] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383595 (10Jelto) 03NEW [17:48:15] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2135.codfw.wmnet - https://phabricator.wikimedia.org/T383426#10454740 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:51:42] RECOVERY - Disk space on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1075&var-datasource=eqiad+prometheus/ops [17:52:44] (03PS1) 10Jelto: Rename mw241[2-5] to wikikube-worker22[12-15] [puppet] - 10https://gerrit.wikimedia.org/r/1110822 (https://phabricator.wikimedia.org/T377877) [17:52:52] (03PS8) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [17:53:02] RECOVERY - Hadoop NodeManager on analytics1075 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:53:56] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4788/co" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800) [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T1800). [18:04:40] (03CR) 10Btullis: [C:03+1] logstash - remove legacy eventlogging related input and filters [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:05:38] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1110823 [18:06:18] (03CR) 10Kamila Součková: [C:03+1] Rename mw241[2-5] to wikikube-worker22[12-15] [puppet] - 10https://gerrit.wikimedia.org/r/1110822 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [18:06:51] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-01-13-165415-production [puppet] - 10https://gerrit.wikimedia.org/r/1110823 (owner: 10Majavah) [18:07:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:07:57] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:08:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 7.722 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:11] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:40] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [18:09:53] (03CR) 10Andrea Denisse: [C:03+1] prometheus: add initial lv size to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1109680 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [18:10:04] FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:10:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10454957 (10Jhancock.wm) i updated the ticket with that info. it might be related. still working with Dell. [18:10:23] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10454956 (10kamila) [18:13:33] (03CR) 10Kamila Součková: [C:03+2] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková) [18:14:43] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1445 to wikikube-worker1096 [18:14:49] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [18:15:43] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [18:15:45] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1110798 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [18:15:59] (03PS1) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:16:01] (03CR) 10Scott French: "Thanks in advance for the review, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110818 (owner: 10Scott French) [18:16:26] (03PS1) 10Dzahn: planet: remove smash.ro from Romanian feeds [puppet] - 10https://gerrit.wikimedia.org/r/1110827 (https://phabricator.wikimedia.org/T383580) [18:16:30] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:17:03] 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455000 (10andrea.denisse) [18:17:14] 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455002 (10andrea.denisse) [18:17:51] (03PS2) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:17:55] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:17:56] (03CR) 10Dzahn: [C:03+2] "domain is clearly for sale - and update service crashed trying to parse the feed" [puppet] - 10https://gerrit.wikimedia.org/r/1110827 (https://phabricator.wikimedia.org/T383580) (owner: 10Dzahn) [18:19:01] kamila_: we have a merge conflict. my side is harmless. yours might be more tricky, host renames. I leave it to you when to merge both at once. [18:19:36] (03PS3) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:19:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:19:45] mutante: doing it right now [18:19:51] kamila_: ack, thanks:) [18:21:04] done [18:21:18] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1446 to wikikube-worker1097 [18:21:27] 06SRE, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10455034 (10andrea.denisse) a:03Clement_Goubert Thanks Claime, I'm removing the o11y tag and assigning this to you as you currently have... [18:21:38] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [18:21:56] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1445 to wikikube-worker1096 - kamila@cumin1002" [18:22:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1445 to wikikube-worker1096 - kamila@cumin1002" [18:22:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:22:16] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1096 [18:23:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1096 [18:23:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1445 to wikikube-worker1096 [18:23:54] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw1445 to wikikube-worker1096 completed: - mw1445 (**PASS**) - ✔️ Downt... [18:25:18] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1446 to wikikube-worker1097 - kamila@cumin1002" [18:25:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1446 to wikikube-worker1097 - kamila@cumin1002" [18:25:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:23] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1097 [18:26:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1097 [18:26:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1446 to wikikube-worker1097 [18:26:55] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1096.eqiad.wmnet wikikube-worker1097.eqiad.wmnet on all recursors [18:26:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1096.eqiad.wmnet wikikube-worker1097.eqiad.wmnet on all recursors [18:26:59] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw1446 to wikikube-worker1097 completed: - mw1446 (**PASS**) - ✔️ Downt... [18:27:48] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1096.eqiad.wmnet with OS bookworm [18:27:52] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1096 [18:27:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1096 [18:27:59] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1097.eqiad.wmnet with OS bookworm [18:28:02] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1097 [18:28:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1097 [18:28:03] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-worker1096.eqiad.wmnet with OS bookworm [18:28:04] (03PS4) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:28:14] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-worker1097.eqiad.wmnet with OS bookworm [18:28:24] (03PS5) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:28:26] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:28:27] 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10455072 (10Dzahn) New acme-chief config has been deployed and ncredir* hosts now have a TLS cert for wikimedia.ro and wikipedia.ro. [18:29:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:30:52] (03PS6) 10CDanis: haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) [18:30:56] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:33:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [18:41:16] (03PS1) 10DCausse: search: update WDQS update lag SLI/SLO queries [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 [18:42:49] (03CR) 10DCausse: "Categories are now reporting their lag in prometheus and seems to leak into the series used by this SLI/SLO." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse) [18:43:13] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1096.eqiad.wmnet with reason: host reimage [18:43:20] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1097.eqiad.wmnet with reason: host reimage [18:43:21] (03PS3) 10Scott French: shellbox-syntaxhighlight: 1 eqiad replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087579 (https://phabricator.wikimedia.org/T377038) [18:43:22] (03PS3) 10Scott French: shellbox-syntaxhighlight: all eqiad replicas on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087580 (https://phabricator.wikimedia.org/T377038) [18:43:24] (03PS3) 10Scott French: shellbox-syntaxhighlight: 1 codfw replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087581 (https://phabricator.wikimedia.org/T377038) [18:43:30] (03PS3) 10Scott French: shellbox-syntaxhighlight: all replicas on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087582 (https://phabricator.wikimedia.org/T377038) [18:44:59] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:45:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:47:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1096.eqiad.wmnet with reason: host reimage [18:47:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53369 bytes in 5.907 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:48:14] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:49:05] (03CR) 10CDanis: "pcc lgtm: https://puppet-compiler.wmflabs.org/output/1110826/5239/" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [18:50:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1097.eqiad.wmnet with reason: host reimage [18:52:35] (03CR) 10Dzahn: [C:03+2] codesearch: Remove obsolete apt pinning code for buster [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479) (owner: 10Muehlenhoff) [18:53:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:01:38] (03CR) 10Dzahn: [C:03+2] "noop on codesearch9.codesearch" [puppet] - 10https://gerrit.wikimedia.org/r/1110767 (https://phabricator.wikimedia.org/T367479) (owner: 10Muehlenhoff) [19:04:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1096.eqiad.wmnet with OS bookworm [19:04:53] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker1096.eqiad.wmnet with OS bookworm completed: - wikiku... [19:08:33] (03CR) 10Ssingh: [C:03+1] "Thanks for rolling this out everywhere!" [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [19:08:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1097.eqiad.wmnet with OS bookworm [19:09:09] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455203 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-worker1097.eqiad.wmnet with OS bookworm completed: - wikiku... [19:13:24] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:05] (03CR) 10Ryan Kemper: [C:03+1] "Can deploy later today" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1110833 (owner: 10DCausse) [19:28:24] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10455426 (10VRiley-WMF) @Marostegui I have replaced thr cable, could you check this? [19:46:23] 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455430 (10Tgr) >>! In T383513#10453786, @matmarex wrote: > Beta cluster Logstash data says that object... [19:56:41] jouncebot: nowandnext [19:56:42] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [19:56:42] In 1 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2100) [19:59:18] (03CR) 10CDanis: [C:03+2] haproxy: bwlim-by-path: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/1110826 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [20:00:23] 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455480 (10Tgr) systemctl says ` Jan 11 08:09:58 deployment-sessionstore06 systemd[1]: cassandra.servic... [20:02:48] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10455483 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF I had to take the server down in order replace the drive. I will move forward with closing the ticket. [20:02:49] !log homer cr*eqiad* commit T354791 [20:05:33] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1096-1097].eqiad.wmnet [20:05:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1096-1097].eqiad.wmnet [20:05:52] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455510 (10ops-monitoring-bot) pool host wikikube-worker[1096-1097].eqiad.wmnet by kamila@cumin1002 with reason: None [20:05:56] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10455511 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by kamila@cumin1002 pool for host wikikube-worker[1096-1097].eqiad.wmnet completed: - wiki... [20:07:33] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620 (10kamila) 03NEW [20:10:02] (03CR) 10Mforns: "I do not fully understand what each line does, but I get this is realted to the addition of the file and filerevision tables that we talke" [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [20:10:21] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T383076#10455533 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF Duplicate [20:13:13] 06SRE, 10Observability-Metrics, 10superset.wikimedia.org, 10Data-Platform-SRE (2025.01.11 - 2025.01.31): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761#10455539 (10Gehel) [20:17:08] 06SRE, 10Beta-Cluster-Infrastructure, 06MediaWiki-Platform-Team, 10MediaWiki-User-login-and-signup: Cannot log in or perform any actions on Beta Cluster wikis - https://phabricator.wikimedia.org/T383513#10455563 (10Tgr) 05Open→03Resolved a:03Tgr Optimistically closing, maybe Cassandra just needs... [20:18:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10455570 (10VRiley-WMF) [20:18:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T383475#10455572 (10VRiley-WMF) →14Duplicate dup:03T382984 [20:20:13] (03CR) 10Cwhite: [C:03+2] "No new traffic since Jan 7: https://grafana.wikimedia.org/goto/QUy5p-DHR?orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/1110807 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:24:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10455622 (10phaultfinder) [20:29:03] (03PS1) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:30:14] (03PS1) 10Ottomata: logstash - remove absented input [puppet] - 10https://gerrit.wikimedia.org/r/1110844 (https://phabricator.wikimedia.org/T238230) [20:30:14] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:31:41] (03PS1) 10Ottomata: admin - ensure unused eventlogging groups are absent [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) [20:32:18] (03CR) 10CI reject: [V:04-1] admin - ensure unused eventlogging groups are absent [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:32:29] (03CR) 10Ottomata: "There are a couple of places where e.g. eventlogging-admins is included (webperf)." [puppet] - 10https://gerrit.wikimedia.org/r/1110845 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10455755 (10phaultfinder) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2100). nyaa~ [21:00:05] kemayo and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] here [21:00:14] Also here [21:05:15] cdanis: Anyone around to do the deployment? [21:07:53] Kemayo: technically not an SRE responsibility 😅 but I'll help out [21:08:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch) [21:08:54] I'll admit that I might tend to lump SRE and Releng into the same bucket in my head. >_> [21:09:16] (03Merged) 10jenkins-bot: Set Flow to read-only on phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110814 (https://phabricator.wikimedia.org/T378834) (owner: 10DLynch) [21:09:36] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]] [21:09:40] T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834 [21:11:33] Kemayo: would having deploy access be useful to you, btw? [21:12:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:27] also Kemayo your patch is on k8s-mwdebug [21:12:41] cdanis: It might occasionally. I'm the most common person to be doing backports for Editing, but historically it's gone okay just fitting them into the existing windows. [21:13:02] cdanis: Looks good, go ahead and continue. [21:14:30] !log cdanis@deploy2002 kemayo, cdanis: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:33] !log cdanis@deploy2002 kemayo, cdanis: Continuing with sync [21:15:29] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4789/co" [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall) [21:16:40] Feel free to skip "update the interwiki cache" when you get to my entries if you don't feel comfortable doing it - the process is a bit complicated and it will happen by itself in a week or two anyway - just figured I could save some trouble since I was attending a backport window anyway [21:17:17] Pppery: thanks, I'm short on time and pinch-hitting so I'll skip that :) [21:17:30] (03CR) 10CDanis: [C:03+2] Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:17:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455925 (10VRiley-WMF) 05Open→03In progress [21:17:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455927 (10VRiley-WMF) Rebooting Now [21:18:49] (03Merged) 10jenkins-bot: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:21:33] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall) [21:22:18] Kemayo: could these maintenance script errors have anything to do with your patch? https://logstash.wikimedia.org/goto/063ead7eb8f71773424aa37d54c6840f [21:23:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10455949 (10VRiley-WMF) This has been rebooted @cmooney would you be able to check this when you have a chance? [21:23:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:23:39] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110814|Set Flow to read-only on phase 2a wikis (T378834)]] (duration: 14m 02s) [21:23:42] T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834 [21:23:57] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]] [21:24:06] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [21:24:06] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [21:24:06] T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463 [21:24:34] cdanis: I don't see how they possibly could. [21:24:43] cool [21:25:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:36] Pppery: is your patch one that it makes sense to check on the testservers? [21:25:40] yes [21:26:37] Pppery: okay, your patch should ~now be live on k8s-mwdebug [21:26:41] looking [21:28:52] Well I missed one of the settings I was supposed to change but the patch doesn't break anything so it's still safe to sync [21:28:57] !log cdanis@deploy2002 cdanis, pppery: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:28:57] 👍 [21:29:04] !log cdanis@deploy2002 cdanis, pppery: Continuing with sync [21:30:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:39] (03PS1) 10Pppery: Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) [21:32:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2069-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:34:21] (03CR) 10CDanis: [C:03+2] Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:35:03] (03Merged) 10jenkins-bot: Add missing parsoid settings for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:35:05] Thanks for deploying the follow-up too! [21:37:21] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1104732|Configure new wikis (T381379 T381080 T378463)]] (duration: 13m 23s) [21:37:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110860 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:37:26] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [21:37:26] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [21:37:27] T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463 [21:37:40] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]] [21:37:42] (03PS1) 10Bking: cloudelastic: remove cloudelastic100[56] from conftool, add 101[12] [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T378368) [21:38:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:40:41] Pppery: k8s testservers ready :) [21:41:43] Looks good [21:42:46] !log cdanis@deploy2002 cdanis, pppery: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:42:49] !log cdanis@deploy2002 cdanis, pppery: Continuing with sync [21:42:52] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [21:42:52] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [21:42:53] T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463 [21:49:43] (03CR) 10Cwhite: [C:03+2] logstash - remove absented input [puppet] - 10https://gerrit.wikimedia.org/r/1110844 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [21:50:53] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1110860|Add missing parsoid settings for new wikis (T381379 T381080 T378463)]] (duration: 13m 12s) [21:50:59] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [21:51:00] T381080: Post-creation work for idwikivoyage - https://phabricator.wikimedia.org/T381080 [21:51:00] T378463: Post-creation work for tcywiktionary - https://phabricator.wikimedia.org/T378463 [21:51:06] Thanks, and sorry for causing you so much trouble. [21:53:00] !log bking@cumin2002 conftool action : set/pooled=no; selector: service=(cloudelastic-chi-ssl|cloudelastic-psi-ssl|cloudelastic-omega-ssl|cloudelastic-chi-ssl-public|cloudelastic-psi-ssl-public|cloudelastic-omega-ssl-public),name=cloudelastic1005.eqiad.wmnet [21:53:32] !log bking@cumin2002 conftool action : set/pooled=no; selector: service=(cloudelastic-chi-ssl|cloudelastic-psi-ssl|cloudelastic-omega-ssl|cloudelastic-chi-ssl-public|cloudelastic-psi-ssl-public|cloudelastic-omega-ssl-public),name=cloudelastic1006.eqiad.wmnet [21:59:10] 06SRE, 10observability, 10Observability-Logging, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q2): ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710#10456171 (10andrea.denisse) p:05Triage→03Medium [21:59:34] (03PS1) 10Pppery: Add simplewiki to mobile-anon-talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) [22:00:05] Reedy, sbassett, Maryum, and manfredi: That opportune time for a Weekly Security deployment window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250113T2200). [22:02:44] (03CR) 10Jdlrobson: [C:03+1] "Patch looks good and has followed site request process! Go ahead and deploy as needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1110866 (https://phabricator.wikimedia.org/T383161) (owner: 10Pppery) [22:07:28] (03PS1) 10CDanis: add kemayo to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1110867 [22:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456188 (10phaultfinder) [22:10:04] FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:21:11] (03CR) 10BCornwall: [V:03+1 C:03+2] ncredir: Reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/1110857 (https://phabricator.wikimedia.org/T383599) (owner: 10BCornwall) [22:22:58] (03PS2) 10Bking: cloudelastic: remove cloudelastic100[56] from conftool, add 101[12] [puppet] - 10https://gerrit.wikimedia.org/r/1110862 (https://phabricator.wikimedia.org/T380937) [22:23:25] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456229 (10phaultfinder) [22:25:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:26:14] !log cwhite@deploy2002 Started deploy [statsv/statsv@42a4331]: T382729 [22:26:18] T382729: statsv: track metric types handled - https://phabricator.wikimedia.org/T382729 [22:26:23] !log cwhite@deploy2002 Finished deploy [statsv/statsv@42a4331]: T382729 (duration: 00m 08s) [22:26:42] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10Observability-Alerting: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350#10456238 (10andrea.denisse) Hi @cmooney, I noticed that patch 915489 has been merged. Do you know if there’s any remaining... [22:36:30] (03PS1) 10JHathaway: postfix: increase message size limit from 10MiB to 50MiB [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271) [22:36:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110873 (https://phabricator.wikimedia.org/T383271) (owner: 10JHathaway) [22:37:50] (03PS2) 10Andrea Denisse: profile::mediawiki::common: Remove obsolete DSH group check [puppet] - 10https://gerrit.wikimedia.org/r/1110872 (https://phabricator.wikimedia.org/T370527) [22:50:37] 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10456325 (10BCornwall) 05In progress→03Resolved This is all done now. Thanks all! [22:51:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:51:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:27:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10456415 (10phaultfinder) [23:41:17] (03PS1) 10Scott French: P:conftool: allow the parsercache section flavor [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) [23:42:17] (03CR) 10Scott French: "Thanks for the review, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1110880 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [23:46:43] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638 (10phaultfinder) 03NEW [23:46:59] (03PS1) 10Btullis: airflow: Allow specific task pods to access the kube-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1110883 (https://phabricator.wikimedia.org/T383430)