[00:05:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:06:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P72090 and previous config saved to /var/cache/conftool/dbconfig/20250116-000630-ladsgroup.json [00:13:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [00:21:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T370903)', diff saved to https://phabricator.wikimedia.org/P72091 and previous config saved to /var/cache/conftool/dbconfig/20250116-002137-ladsgroup.json [00:21:42] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:34:17] (03PS2) 10Scott French: Add variables for incremental enrollment in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) [00:34:17] (03CR) 10Scott French: "Thanks for the earlier review, Timo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [00:35:54] (03CR) 10Scott French: "@effie@wikimedia.org FYI - This is patch 1 of 2 remaining for the cookie-based traffic routing work." [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111725 [00:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111725 (owner: 10TrainBranchBot) [00:50:46] (03PS3) 10Scott French: mw-(web|api-ext)-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100555 (https://phabricator.wikimedia.org/T377040) [00:50:46] (03CR) 10Scott French: "Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100555 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [00:51:26] (03PS2) 10Scott French: hieradata: switch mw-(web|api-ext)-next to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1100556 (https://phabricator.wikimedia.org/T377040) [00:51:27] (03CR) 10Scott French: "And this is the counterpart to Id485633611b9f6bb88c932ba23e5dbb71845b6f7. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1100556 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [00:57:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111725 (owner: 10TrainBranchBot) [01:08:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111727 [01:08:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111727 (owner: 10TrainBranchBot) [01:15:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10465156 (10phaultfinder) [01:27:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111727 (owner: 10TrainBranchBot) [01:40:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:50] (03CR) 10Ottomata: [C:03+1] Web UI actions: Increase sampling rate to 100% for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) (owner: 10Jdlrobson) [03:32:05] (03PS1) 10Ottomata: InitialiseSettings-labs.php - remove unused rc0.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111733 (https://phabricator.wikimedia.org/T311129) [03:41:26] (03PS3) 10Jdlrobson: Beta: Update schemas in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) [03:41:39] (03CR) 10Jdlrobson: [C:04-1] "Squashed into https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715 per request" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111733 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [03:44:59] (03Abandoned) 10Ottomata: InitialiseSettings-labs.php - remove unused rc0.mediawiki.page_change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111733 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [04:55:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111661 (https://phabricator.wikimedia.org/T383785) (owner: 10Anzx) [05:12:33] mutante: Can anyone else look at, https://phabricator.wikimedia.org/T383750 - seems unbreak now for MinT/CX [06:40:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T0700) [07:00:05] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T0700). [07:02:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:02:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:06:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:07:18] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:36:52] (03PS1) 10Bartosz Dziewoński: mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111912 (https://phabricator.wikimedia.org/T383497) [07:36:58] (03PS1) 10Bartosz Dziewoński: mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111913 (https://phabricator.wikimedia.org/T383497) [07:37:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111912 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [07:37:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111913 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [07:45:12] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2354-2357].codfw.wmnet [07:47:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2354-2357].codfw.wmnet [07:49:27] (03CR) 10Jelto: [C:03+2] Rename mw235[4-7] to wikikube-worker22[28-31] [puppet] - 10https://gerrit.wikimedia.org/r/1111646 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [07:53:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Idle - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Idle - kubernetes-codfw, AS64602/IPv4: Idle - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BG [07:53:59] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2354 to wikikube-worker2228 [07:54:20] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:55:54] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [07:55:54] status [07:57:58] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2354 to wikikube-worker2228 - jelto@cumin1002" [07:58:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2354 to wikikube-worker2228 - jelto@cumin1002" [07:58:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:58:18] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2228 [07:58:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2228 [07:59:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2354 to wikikube-worker2228 [07:59:57] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2355 to wikikube-worker2229 [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T0800). [08:00:05] _joe_, anzx, and MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:14] o/ [08:00:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:00:19] hi [08:00:34] <_joe_> good morning [08:01:09] <_joe_> Given I'm first in line, I can deploy my patches, but do we have a backporter? [08:01:42] <_joe_> well I'll start [08:02:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [08:02:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:42] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:42] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:02:46] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:03:03] <_joe_> Amir1 / urbanecm around by chance? [08:03:14] (03Merged) 10jenkins-bot: Explicitly disable all local imagescaling on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [08:03:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:46] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2355 to wikikube-worker2229 - jelto@cumin1002" [08:04:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2355 to wikikube-worker2229 - jelto@cumin1002" [08:04:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:04:06] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2229 [08:04:09] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:987432|Explicitly disable all local imagescaling on k8s (T352515)]] [08:04:16] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [08:04:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2229 [08:05:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2355 to wikikube-worker2229 [08:05:29] <_joe_> MatmaRex, anzx: I will deploy my patches but I don't have time to deploy all patches today tbh - I don't see any deployers available :( [08:06:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2356 to wikikube-worker2230 [08:06:07] o/ [08:06:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:09] good morning [08:06:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:06:55] (03CR) 10Hashar: [C:03+2] mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111912 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [08:06:56] (03CR) 10Hashar: [C:03+2] mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111913 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [08:07:19] I have +2ed oth MatmaRex patches ahead of time given they will take time to be merged [08:07:29] _joe_: do you deploy right now or do you want me to do it? [08:07:43] <_joe_> hashar: I have +2 [08:07:56] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:08:00] <_joe_> and global root, so the answer to "can you" is always "yes" :) [08:08:07] thanks [08:08:21] <_joe_> and I am deploying the first patch [08:08:30] yeah that was my question :b [08:08:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2357:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2357 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:08:50] I will do both anzx patches after that [08:09:07] ok [08:10:00] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2356 to wikikube-worker2230 - jelto@cumin1002" [08:10:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2356 to wikikube-worker2230 - jelto@cumin1002" [08:10:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:10:16] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2230 [08:10:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2230 [08:10:39] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:987432|Explicitly disable all local imagescaling on k8s (T352515)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:10:42] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [08:11:06] wmgExtraLanguageNames : This is (temporarily) needed due to T264295 [08:11:07] T264295: Reinstate $wgExtraLanguageCodes in production - https://phabricator.wikimedia.org/T264295 [08:11:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2356 to wikikube-worker2230 [08:12:00] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2357 to wikikube-worker2231 [08:12:02] !log oblivian@deploy2002 oblivian: Continuing with sync [08:12:22] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:13:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:56] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:14:42] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:14:42] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:14:46] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:15:50] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2357 to wikikube-worker2231 - jelto@cumin1002" [08:15:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2357 to wikikube-worker2231 - jelto@cumin1002" [08:16:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:16:25] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2231 [08:16:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2231 [08:16:44] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:987432|Explicitly disable all local imagescaling on k8s (T352515)]] (duration: 12m 35s) [08:16:49] T352515: RuntimeException: firejail is enabled, but cannot be found - https://phabricator.wikimedia.org/T352515 [08:17:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2357 to wikikube-worker2231 [08:17:36] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2228.codfw.wmnet wikikube-worker2229.codfw.wmnet wikikube-worker2230.codfw.wmnet wikikube-worker2231.codfw.wmnet on all recursors [08:17:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2228.codfw.wmnet wikikube-worker2229.codfw.wmnet wikikube-worker2230.codfw.wmnet wikikube-worker2231.codfw.wmnet on all recursors [08:18:43] <_joe_> hashar: can I proceed with my second backport? [08:18:53] sure [08:19:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [08:19:47] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2228.codfw.wmnet with OS bookworm [08:19:53] (03Merged) 10jenkins-bot: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [08:19:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2228 [08:20:05] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:20:21] !log oblivian@deploy2002 Started scap sync-world: Backport for [[gerrit:1109108|ClusterConfig: add support for dumps trait (T382947)]] [08:20:24] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [08:23:34] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2228 - jelto@cumin1002" [08:23:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2228 - jelto@cumin1002" [08:23:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:23:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2228.codfw.wmnet 204.32.192.10.in-addr.arpa 4.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:23:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2228.codfw.wmnet 204.32.192.10.in-addr.arpa 4.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:23:41] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2228 [08:23:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2228 [08:23:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2228 [08:24:35] anzx: I will do matmarex patches next since they are about to be merged [08:25:01] 06SRE, 06Infrastructure-Foundations: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10465484 (10Volans) @Scott_French I was about to have a look at what we have in the console but I see you already progressed on this. From I/F point of view it's totally fine to give acce... [08:26:50] (03Merged) 10jenkins-bot: mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111912 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [08:26:56] (03Merged) 10jenkins-bot: mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111913 (https://phabricator.wikimedia.org/T383497) (owner: 10Bartosz Dziewoński) [08:26:56] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:1109108|ClusterConfig: add support for dumps trait (T382947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:27:00] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [08:28:13] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2229.codfw.wmnet with OS bookworm [08:28:14] hashar: sure i will wait [08:28:19] !log oblivian@deploy2002 oblivian: Continuing with sync [08:28:23] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2229 [08:28:33] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:30:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72092 and previous config saved to /var/cache/conftool/dbconfig/20250116-083051-root.json [08:31:56] (03PS1) 10Marostegui: instances.yaml: Add es1045 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111918 (https://phabricator.wikimedia.org/T382569) [08:32:08] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2229 - jelto@cumin1002" [08:32:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2229 - jelto@cumin1002" [08:32:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:32:13] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2229.codfw.wmnet 205.32.192.10.in-addr.arpa 5.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:32:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2229.codfw.wmnet 205.32.192.10.in-addr.arpa 5.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:32:17] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2229 [08:32:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2229 [08:32:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2229 [08:32:51] !log oblivian@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109108|ClusterConfig: add support for dumps trait (T382947)]] (duration: 12m 30s) [08:32:55] T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) - https://phabricator.wikimedia.org/T382947 [08:33:21] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1045 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111918 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [08:35:08] <_joe_> hashar: I'm done [08:35:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:35:31] thanks [08:35:41] MatmaRex: I am pushing your change to both wmf branches [08:35:55] sure [08:36:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1045 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P72093 and previous config saved to /var/cache/conftool/dbconfig/20250116-083559-marostegui.json [08:36:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2230.codfw.wmnet with OS bookworm [08:36:04] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [08:36:11] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2230 [08:36:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:36:17] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1111912|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]], [[gerrit:1111913|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]] [08:36:19] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:36:21] T383497: VisualEditor "Insert link widget" sometimes does not suggest pages - https://phabricator.wikimedia.org/T383497 [08:39:44] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2230 - jelto@cumin1002" [08:39:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2230 - jelto@cumin1002" [08:39:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:39:49] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2230.codfw.wmnet 206.32.192.10.in-addr.arpa 6.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:39:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2230.codfw.wmnet 206.32.192.10.in-addr.arpa 6.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:39:53] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2230 [08:40:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2230 [08:40:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2230 [08:40:53] !log hashar@deploy2002 matmarex, hashar: Backport for [[gerrit:1111912|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]], [[gerrit:1111913|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:41:03] (03PS1) 10Marostegui: es1045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111919 [08:41:27] hashar: tested, looks good [08:41:39] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2228.codfw.wmnet with reason: host reimage [08:41:44] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2231.codfw.wmnet with OS bookworm [08:41:46] excellent [08:41:49] !log hashar@deploy2002 matmarex, hashar: Continuing with sync [08:41:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2231 [08:42:06] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:45:36] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2231 - jelto@cumin1002" [08:45:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2231 - jelto@cumin1002" [08:45:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:41] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2231.codfw.wmnet 207.32.192.10.in-addr.arpa 7.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:45:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2228.codfw.wmnet with reason: host reimage [08:45:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2231.codfw.wmnet 207.32.192.10.in-addr.arpa 7.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:45:44] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2231 [08:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72094 and previous config saved to /var/cache/conftool/dbconfig/20250116-084557-root.json [08:45:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2231 [08:45:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2231 [08:46:54] (03CR) 10Marostegui: [C:03+2] es1045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111919 (owner: 10Marostegui) [08:47:20] 08:42:45 Logstash checker Counted 11 error(s) in the last 20 seconds. The threshold is 10. [08:47:24] [8 hits] Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock in /srv/mediawiki/php-1.44.0-wmf.12/includes/config/EtcdConfig.php:231 [08:47:35] * hashar retries [08:49:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P72095 and previous config saved to /var/cache/conftool/dbconfig/20250116-084946-root.json [08:50:18] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2229.codfw.wmnet with reason: host reimage [08:51:11] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111912|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]], [[gerrit:1111913|mediawiki.widgets: Fix aborting TitleWidget request breaking it permanently (T383497)]] (duration: 14m 53s) [08:51:14] T383497: VisualEditor "Insert link widget" sometimes does not suggest pages - https://phabricator.wikimedia.org/T383497 [08:53:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1023 to es5 primary T382569', diff saved to https://phabricator.wikimedia.org/P72096 and previous config saved to /var/cache/conftool/dbconfig/20250116-085305-root.json [08:53:09] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [08:53:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2229.codfw.wmnet with reason: host reimage [08:53:30] MatmaRex: your change is live [08:53:36] anzx: I am doing your now :) [08:53:40] thanks hashar [08:53:49] ok [08:53:58] is there anything to be verified on the test servers? [08:54:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111661 (https://phabricator.wikimedia.org/T383785) (owner: 10Anzx) [08:54:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1025', diff saved to https://phabricator.wikimedia.org/P72097 and previous config saved to /var/cache/conftool/dbconfig/20250116-085439-marostegui.json [08:55:09] (03Merged) 10jenkins-bot: Add dso and thq to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111661 (https://phabricator.wikimedia.org/T383785) (owner: 10Anzx) [08:55:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1025.eqiad.wmnet with reason: cloning [08:55:38] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1111661|Add dso and thq to wmgExtraLanguageNames (T383785)]] [08:55:41] T383785: Add dso and thq to language names - https://phabricator.wikimedia.org/T383785 [08:58:29] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2230.codfw.wmnet with reason: host reimage [08:59:15] (03PS1) 10Marostegui: mariadb: Productionize es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1111921 (https://phabricator.wikimedia.org/T382569) [08:59:54] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1046 [puppet] - 10https://gerrit.wikimedia.org/r/1111921 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [09:01:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72098 and previous config saved to /var/cache/conftool/dbconfig/20250116-090102-root.json [09:01:57] !log hashar@deploy2002 anzx, hashar: Backport for [[gerrit:1111661|Add dso and thq to wmgExtraLanguageNames (T383785)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:02:00] T383785: Add dso and thq to language names - https://phabricator.wikimedia.org/T383785 [09:02:08] :) [09:02:12] hashar: checking [09:03:06] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2231.codfw.wmnet with reason: host reimage [09:03:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2230.codfw.wmnet with reason: host reimage [09:03:47] hashar: looks good, new language codes show up in timedtext [09:03:52] awesome [09:03:55] !log hashar@deploy2002 anzx, hashar: Continuing with sync [09:04:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P72099 and previous config saved to /var/cache/conftool/dbconfig/20250116-090451-root.json [09:05:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2228.codfw.wmnet with OS bookworm [09:06:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2231.codfw.wmnet with reason: host reimage [09:08:24] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111661|Add dso and thq to wmgExtraLanguageNames (T383785)]] (duration: 12m 46s) [09:08:28] T383785: Add dso and thq to language names - https://phabricator.wikimedia.org/T383785 [09:08:30] (03CR) 10Brouberol: [C:03+1] mw-content-history-reconcile-enrich: Add HA storageDir and Ceph egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [09:08:46] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1111180 (owner: 10Muehlenhoff) [09:09:02] hashar: i have changed my email in https://gerrit.wikimedia.org/r/c/integration/config/+/1111920 please review [09:09:16] thank you for deploying [09:09:17] yes I am on it :) [09:12:23] anzx: done! :) [09:12:51] !log UTC morning backport window completed. [09:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:38] hashar: thanks again [09:14:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2229.codfw.wmnet with OS bookworm [09:14:46] (03CR) 10Marostegui: [C:03+1] dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo) [09:16:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72100 and previous config saved to /var/cache/conftool/dbconfig/20250116-091607-root.json [09:18:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:20:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 3%: Repooling', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250116-091956-root.json [09:22:18] (03CR) 10David Caro: "Related task https://phabricator.wikimedia.org/T383817" [puppet] - 10https://gerrit.wikimedia.org/r/1111691 (owner: 10Andrew Bogott) [09:23:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:23:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2230.codfw.wmnet with OS bookworm [09:25:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2231.codfw.wmnet with OS bookworm [09:26:49] !log homer 'lsw1-c6-codfw*' commit 'T377877' [09:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:52] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [09:27:46] !log homer 'cr*codfw*' commit 'T377877' [09:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10465663 (10VRiley-WMF) [09:29:07] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 88, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:29:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2228-2231].codfw.wmnet [09:29:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2228-2231].codfw.wmnet [09:30:28] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862 (10Jelto) 03NEW [09:31:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72102 and previous config saved to /var/cache/conftool/dbconfig/20250116-093113-root.json [09:32:13] (03PS1) 10Hashar: Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) [09:33:58] (03PS1) 10Jelto: Rename mw235[0-3] to wikikube-worker223[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1111926 (https://phabricator.wikimedia.org/T377877) [09:35:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P72103 and previous config saved to /var/cache/conftool/dbconfig/20250116-093505-root.json [09:42:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10465774 (10VRiley-WMF) [09:42:53] (03PS2) 10Hashar: Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) [09:44:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc5 eqiad and codfw dbmaint T383398', diff saved to https://phabricator.wikimedia.org/P72104 and previous config saved to /var/cache/conftool/dbconfig/20250116-094439-marostegui.json [09:44:44] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [09:45:09] (03CR) 10JMeybohm: [C:03+1] Rename mw235[0-3] to wikikube-worker223[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1111926 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:45:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc[2015,2017].codfw.wmnet,pc[1015,1017].eqiad.wmnet with reason: reorganizing pc5 [09:48:52] (03CR) 10JMeybohm: [C:03+1] sre.k8s.renumber-node: change default os to bookworm [cookbooks] - 10https://gerrit.wikimedia.org/r/1111588 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [09:49:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10465796 (10JMeybohm) >>! In T381788#10462502, @Jhancock.wm wrote: > that will work for us. Cool. I'll make sure the server is properly shut down by 15:30Z and we can sync here... [09:50:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P72106 and previous config saved to /var/cache/conftool/dbconfig/20250116-095011-root.json [09:52:54] (03PS1) 10Filippo Giunchedi: Revert "thanos-store: enable caching bucket" [puppet] - 10https://gerrit.wikimedia.org/r/1111930 (https://phabricator.wikimedia.org/T383570) [09:53:50] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "thanos-store: enable caching bucket" [puppet] - 10https://gerrit.wikimedia.org/r/1111930 (https://phabricator.wikimedia.org/T383570) (owner: 10Filippo Giunchedi) [09:53:58] (03CR) 10CI reject: [V:04-1] Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [09:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote pc2015 to pc5 codfw master dbmaint and enable pc5 back in eqiad and codfw T383398', diff saved to https://phabricator.wikimedia.org/P72107 and previous config saved to /var/cache/conftool/dbconfig/20250116-095431-marostegui.json [09:54:35] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [09:56:03] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2350-2353].codfw.wmnet [09:56:06] (03PS1) 10Marostegui: site.pp: Reorganize pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111931 (https://phabricator.wikimedia.org/T383398) [09:57:07] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize pc5 [puppet] - 10https://gerrit.wikimedia.org/r/1111931 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [09:58:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2350-2353].codfw.wmnet [09:58:48] (03CR) 10Jelto: [C:03+2] Rename mw235[0-3] to wikikube-worker223[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/1111926 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [10:00:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:00:58] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2350 to wikikube-worker2232 [10:01:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:01:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:02:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [10:02:15] status [10:04:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [10:04:07] status [10:04:53] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2350 to wikikube-worker2232 - jelto@cumin1002" [10:05:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2350 to wikikube-worker2232 - jelto@cumin1002" [10:05:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:05:08] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2232 [10:05:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72109 and previous config saved to /var/cache/conftool/dbconfig/20250116-100516-root.json [10:05:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2232 [10:05:33] (03PS1) 10Isabelle Hurbain-Palatin: Remove KartographerParsoidSupport flag from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) [10:06:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2350 to wikikube-worker2232 [10:06:50] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2351 to wikikube-worker2233 [10:07:11] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:09:08] (03PS3) 10Hashar: Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) [10:10:41] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2351 to wikikube-worker2233 - jelto@cumin1002" [10:11:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2351 to wikikube-worker2233 - jelto@cumin1002" [10:11:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:05] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2233 [10:11:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2352:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2352 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:11:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2233 [10:12:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2351 to wikikube-worker2233 [10:13:22] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2352 to wikikube-worker2234 [10:13:43] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:16:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2353:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2353 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:17:16] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2352 to wikikube-worker2234 - jelto@cumin1002" [10:17:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2352 to wikikube-worker2234 - jelto@cumin1002" [10:17:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:34] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2234 [10:17:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2234 [10:18:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2352 to wikikube-worker2234 [10:18:53] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [10:19:17] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2353 to wikikube-worker2235 [10:19:45] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw2353 to wikikube-worker2235 [10:20:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72110 and previous config saved to /var/cache/conftool/dbconfig/20250116-102021-root.json [10:20:56] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2353 to wikikube-worker2235 [10:21:06] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:21:54] (03PS1) 10JMeybohm: Pin calico version on all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) [10:22:08] (03CR) 10Isabelle Hurbain-Palatin: [C:04-1] "let's wait until this week's (wmf.12) train is fully rolled out before we merge this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111932 (https://phabricator.wikimedia.org/T340134) (owner: 10Isabelle Hurbain-Palatin) [10:24:52] (03PS1) 10Brouberol: airflow: move the serviceAccountName directly under pod.spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111938 (https://phabricator.wikimedia.org/T383430) [10:25:04] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2353 to wikikube-worker2235 - jelto@cumin1002" [10:25:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2353 to wikikube-worker2235 - jelto@cumin1002" [10:25:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:25:22] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2235 [10:25:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2235 [10:25:50] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111938 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [10:26:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2353 to wikikube-worker2235 [10:26:34] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2232.codfw.wmnet wikikube-worker2233.codfw.wmnet wikikube-worker2234.codfw.wmnet wikikube-worker2235.codfw.wmnet on all recursors [10:26:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2232.codfw.wmnet wikikube-worker2233.codfw.wmnet wikikube-worker2234.codfw.wmnet wikikube-worker2235.codfw.wmnet on all recursors [10:26:54] (03CR) 10Brouberol: [C:03+2] airflow: move the serviceAccountName directly under pod.spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111938 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [10:27:25] jouncebot: now and next [10:27:25] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [10:27:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:28:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:29:21] (03Merged) 10jenkins-bot: Merge tag 'v3.10.4' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111925 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [10:30:20] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2232.codfw.wmnet [10:30:40] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2232.codfw.wmnet with OS bullseye [10:30:50] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2232 [10:31:03] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:31:03] arnaudb: FYI i'm testing a change for T383570 and I've silenced the corresponding page [10:31:05] T383570: thanos query/store OOM on titan hosts - https://phabricator.wikimedia.org/T383570 [10:31:31] ack thanks for the heads up godog [10:34:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2232 - jelto@cumin1002" [10:34:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2232 - jelto@cumin1002" [10:34:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:34] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2232.codfw.wmnet 200.32.192.10.in-addr.arpa 0.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:34:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2232.codfw.wmnet 200.32.192.10.in-addr.arpa 0.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:34:37] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2232 [10:34:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2232 [10:34:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2232 [10:35:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72111 and previous config saved to /var/cache/conftool/dbconfig/20250116-103527-root.json [10:37:52] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2233.codfw.wmnet [10:38:15] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2233.codfw.wmnet with OS bullseye [10:38:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2233 [10:38:30] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:40:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:10] (03CR) 10Clément Goubert: [C:03+1] sre.k8s.renumber-node: change default os to bookworm [cookbooks] - 10https://gerrit.wikimedia.org/r/1111588 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [10:41:56] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2233 - jelto@cumin1002" [10:42:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2233 - jelto@cumin1002" [10:42:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:42:01] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2233.codfw.wmnet 201.32.192.10.in-addr.arpa 1.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:42:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2233.codfw.wmnet 201.32.192.10.in-addr.arpa 1.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:42:04] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2233 [10:42:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2233 [10:42:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2233 [10:45:44] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2234.codfw.wmnet [10:46:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2234.codfw.wmnet with OS bullseye [10:46:18] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2234 [10:46:26] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:46:49] arnaudb: test finished [10:46:54] ack [10:49:53] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2234 - jelto@cumin1002" [10:49:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2234 - jelto@cumin1002" [10:49:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:49:58] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2234.codfw.wmnet 202.32.192.10.in-addr.arpa 2.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:50:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2234.codfw.wmnet 202.32.192.10.in-addr.arpa 2.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:50:01] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2234 [10:50:31] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2232.codfw.wmnet with reason: host reimage [10:50:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72112 and previous config saved to /var/cache/conftool/dbconfig/20250116-105032-root.json [10:50:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2234 [10:50:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2234 [10:52:45] (03PS1) 10Marostegui: mariadb: Decommission db2132 [puppet] - 10https://gerrit.wikimedia.org/r/1111941 (https://phabricator.wikimedia.org/T383697) [10:53:09] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2235.codfw.wmnet [10:53:21] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2235.codfw.wmnet with OS bullseye [10:53:32] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2235 [10:53:48] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:53:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2132.codfw.wmnet [10:54:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2232.codfw.wmnet with reason: host reimage [10:54:13] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission db2132 [puppet] - 10https://gerrit.wikimedia.org/r/1111941 (https://phabricator.wikimedia.org/T383697) (owner: 10Marostegui) [10:57:20] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2235 - jelto@cumin1002" [10:57:46] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2233.codfw.wmnet with reason: host reimage [10:57:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2235 - jelto@cumin1002" [10:57:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:53] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2235.codfw.wmnet 203.32.192.10.in-addr.arpa 3.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2235.codfw.wmnet 203.32.192.10.in-addr.arpa 3.0.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:57:57] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2235 [10:58:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2235 [10:58:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2235 [10:58:45] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [10:59:11] (03PS1) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1100) [11:00:09] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:01:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2132.codfw.wmnet [11:01:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2233.codfw.wmnet with reason: host reimage [11:01:42] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2132.codfw.wmnet - https://phabricator.wikimedia.org/T383697#10465964 (10Marostegui) a:05Marostegui→03None [11:01:54] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2132.codfw.wmnet - https://phabricator.wikimedia.org/T383697#10465970 (10Marostegui) This is ready for #dc-ops [11:05:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72113 and previous config saved to /var/cache/conftool/dbconfig/20250116-110538-root.json [11:06:30] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2234.codfw.wmnet with reason: host reimage [11:10:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2234.codfw.wmnet with reason: host reimage [11:13:40] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2235.codfw.wmnet with reason: host reimage [11:15:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2232.codfw.wmnet with OS bullseye [11:16:10] (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [11:17:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2235.codfw.wmnet with reason: host reimage [11:17:22] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2232.codfw.wmnet [11:19:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2232.codfw.wmnet with OS bookworm [11:19:34] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host wikikube-worker2232.codfw.wmnet with OS bookworm [11:19:46] (03PS1) 10Hashar: Gerrit 3.10.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111944 (https://phabricator.wikimedia.org/T383597) [11:20:03] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10466024 (10cmooney) >>! In T378825#10445805, @Jhancock.wm wrote: > it is cabled up and connected to port 43 on the cloud switch Ok thanks! Yeah I see it... [11:20:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2232.codfw.wmnet with OS bookworm [11:20:42] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2232 [11:20:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2232 [11:20:53] (03CR) 10Jelto: [C:03+2] sre.k8s.renumber-node: change default os to bookworm [cookbooks] - 10https://gerrit.wikimedia.org/r/1111588 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [11:21:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2233.codfw.wmnet with OS bullseye [11:21:45] (03CR) 10Tiziano Fogli: [C:03+1] site: add prometheus200[78] [puppet] - 10https://gerrit.wikimedia.org/r/1111256 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:21:50] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2233.codfw.wmnet [11:22:06] !log root@cumin1002 START - Cookbook sre.mysql.upgrade for db2149.codfw.wmnet [11:22:26] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2233.codfw.wmnet with OS bookworm [11:22:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2149', diff saved to https://phabricator.wikimedia.org/P72114 and previous config saved to /var/cache/conftool/dbconfig/20250116-112235-marostegui.json [11:22:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2233 [11:22:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2233 [11:23:56] (03PS1) 10Giuseppe Lavagetto: aptrepo: allow importing conftool from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1111945 [11:24:39] (03CR) 10Jelto: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111935 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [11:26:46] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: change default os to bookworm [cookbooks] - 10https://gerrit.wikimedia.org/r/1111588 (https://phabricator.wikimedia.org/T341984) (owner: 10Jelto) [11:26:52] (03PS3) 10Jcrespo: dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) [11:26:52] (03PS1) 10Jcrespo: dbbackups: Remove dbprov1001,1002,2001,2002 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1111946 (https://phabricator.wikimedia.org/T362509) [11:26:52] (03CR) 10Kamila Součková: "Do you also need to remove the old names from preseed.yaml or are they gone already?" [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:27:51] (03CR) 10Jcrespo: [C:03+2] "There is a chance still we may have to revert this, if there are issues, but this can be merged now for testing." [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo) [11:28:21] (03PS2) 10Jcrespo: dbbackups: Remove dbprov1001,1002,2001,2002 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1111946 (https://phabricator.wikimedia.org/T362509) [11:29:07] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2149.codfw.wmnet [11:29:18] PROBLEM - MariaDB Replica Lag: s3 #page on db2149 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 403.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:29:30] (03CR) 10Jcrespo: [C:04-1] "Actually, we cannot merge it yet due to notifications being disabled at the same time." [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo) [11:29:49] ^ marostegui expected? [11:30:05] jynus: Not really, it should've been downtimed by the cookbook [11:30:08] Anyway, it is not in production [11:30:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2234.codfw.wmnet with OS bullseye [11:30:29] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2234.codfw.wmnet [11:30:33] it is recovering [11:30:39] ack, thx [11:30:42] !incidents [11:30:43] 5600 (UNACKED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [11:30:43] 5596 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [11:30:46] !ack 5600 [11:30:47] 5600 (ACKED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [11:31:08] did it crash? I got disconnected from the host [11:31:19] going to lunch and I'm otherwise around [11:31:20] RECOVERY - MariaDB Replica Lag: s3 #page on db2149 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:31:21] no, it was a upgrade [11:31:26] The cookbook should have downtimed it [11:31:27] weird [11:31:29] I see [11:31:52] anyway, not a worry [11:32:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2234.codfw.wmnet with OS bookworm [11:32:11] (03PS1) 10KartikMistry: Update cxserver to 2025-01-16-103443-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111948 (https://phabricator.wikimedia.org/T383854) [11:32:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2234 [11:32:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2234 [11:32:31] (03PS1) 10Lucas Werkmeister (WMDE): Check known-good regex patterns directly [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) [11:32:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [11:32:55] So the downtime didn't work, but it was issued: Created silence ID ff390569-05f7-4c58-b3da-cd382809500a [11:32:59] Meh [11:34:13] (03CR) 10Lucas Werkmeister (WMDE): "Optional backport – can also wait for next week’s train, but if we have enough time today, I wouldn’t mind deploying it." [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [11:34:33] now that godog is on call, maybe a good time to flag that the downtime wokflow for alertmanager has some issues/bugs/lacks of functionality/rough edges(?), Manuel [11:34:45] OK to deploy cxserver? [11:34:48] kart_: yes [11:35:12] Thanks! [11:35:19] that's a personal observation, and maybe obs is aware, so just mentioning it in case they are not [11:37:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2235.codfw.wmnet with OS bullseye [11:37:25] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2235.codfw.wmnet [11:38:04] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2235.codfw.wmnet with OS bookworm [11:38:24] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2235 [11:38:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2235 [11:38:27] (03PS1) 10Kamila Součková: kubernetes: rename mw14[39-42] -> wikikube-worker11[07-10] [puppet] - 10https://gerrit.wikimedia.org/r/1111951 (https://phabricator.wikimedia.org/T365571) [11:38:57] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2232.codfw.wmnet with reason: host reimage [11:39:31] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-16-103443-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111948 (https://phabricator.wikimedia.org/T383854) (owner: 10KartikMistry) [11:39:42] (03CR) 10Effie Mouzeli: [C:03+1] Add variables for incremental enrollment in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [11:40:22] (03PS11) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [11:40:23] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: add mw-php-migration to mapping_rules [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [11:40:36] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-16-103443-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111948 (https://phabricator.wikimedia.org/T383854) (owner: 10KartikMistry) [11:40:40] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2233.codfw.wmnet with reason: host reimage [11:42:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2232.codfw.wmnet with reason: host reimage [11:42:30] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:42:52] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:45:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2233.codfw.wmnet with reason: host reimage [11:46:00] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:46:31] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:46:49] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:47:23] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:49:28] !log Updated cxserver to 2025-01-16-103443-production (T383854, T377966) [11:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:33] T383854: cxserver: SectionMapping DB timeout - https://phabricator.wikimedia.org/T383854 [11:49:34] T377966: Make cxserver Logstash logs readable and reliable - https://phabricator.wikimedia.org/T377966 [11:50:07] (03CR) 10Hnowlan: "These are just covered by the general appserver `mw[1-2]*` glob so don't need to be removed" [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:50:14] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2234.codfw.wmnet with reason: host reimage [11:51:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2234.codfw.wmnet with reason: host reimage [11:54:35] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and A:cp [11:56:21] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2235.codfw.wmnet with reason: host reimage [11:57:07] (03PS1) 10Marostegui: Revert "production-m5.sql.erb: Add new grants to ipoid_rw" [puppet] - 10https://gerrit.wikimedia.org/r/1111953 [11:57:52] (03CR) 10Marostegui: "Merging this is a NOOP as the grants need to be revoked from the live DBs" [puppet] - 10https://gerrit.wikimedia.org/r/1111953 (owner: 10Marostegui) [12:01:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2232.codfw.wmnet with OS bookworm [12:02:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2235.codfw.wmnet with reason: host reimage [12:02:53] (03PS1) 10Marostegui: mariadb: Set RBR to all sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1111955 (https://phabricator.wikimedia.org/T383795) [12:03:13] (03CR) 10Marostegui: [C:03+2] Revert "production-m5.sql.erb: Add new grants to ipoid_rw" [puppet] - 10https://gerrit.wikimedia.org/r/1111953 (owner: 10Marostegui) [12:03:21] 10SRE-swift-storage, 10Thumbor: Image issue on ओम राऊत MrWp - https://phabricator.wikimedia.org/T383859#10466131 (10Aklapper) 05Stalled→03Open Please see my previous comment. [12:03:23] (03PS12) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [12:05:20] (03PS3) 10Jcrespo: dbbackups: Remove dbprov1001,1002,2001,2002 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1111946 (https://phabricator.wikimedia.org/T362509) [12:06:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2233.codfw.wmnet with OS bookworm [12:06:22] (03CR) 10Marostegui: "This is a NOOP. The default is already RBR but if we eventually migrate everything to SBR they should still be on RBR so make it explicit." [puppet] - 10https://gerrit.wikimedia.org/r/1111955 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [12:06:27] (03CR) 10Marostegui: [C:03+2] mariadb: Set RBR to all sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1111955 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [12:06:58] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove dbprov1001,1002,2001,2002 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1111946 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [12:09:51] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbprov1001.eqiad.wmnet [12:14:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2234.codfw.wmnet with OS bookworm [12:17:51] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [12:18:44] (03CR) 10Vgutierrez: [C:04-1] "As mentioned on the README.md on this repo you could run the tests locally using:" [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [12:20:00] (03CR) 10Vgutierrez: [C:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [12:20:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and A:cp [12:22:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2235.codfw.wmnet with OS bookworm [12:22:40] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [12:23:12] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [12:23:12] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:23:13] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbprov1001.eqiad.wmnet [12:23:38] !log homer 'lsw1-c6-codfw*' commit 'T377877' [12:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:42] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [12:23:43] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbprov1002.eqiad.wmnet [12:24:32] !log homer 'cr*codfw*' commit 'T377877' [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10466218 (10phaultfinder) [12:25:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 80, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:27:09] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2232-2235].codfw.wmnet [12:27:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2232-2235].codfw.wmnet [12:27:32] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10466230 (10Jelto) [12:29:32] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [12:31:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc4 eqiad and codfw dbmaint T383398', diff saved to https://phabricator.wikimedia.org/P72117 and previous config saved to /var/cache/conftool/dbconfig/20250116-123129-marostegui.json [12:31:33] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [12:32:06] (03PS1) 10Jelto: Rename mw233[5-8] to wikikube-worker223[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1111965 (https://phabricator.wikimedia.org/T377877) [12:32:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2014.codfw.wmnet,pc[1014,1016].eqiad.wmnet with reason: reorganizing pc4 [12:33:18] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [12:36:19] (03CR) 10JMeybohm: [C:03+1] Rename mw233[5-8] to wikikube-worker223[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1111965 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [12:36:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote pc1014 to pc4 eqiad master dbmaint and enable pc4 back in eqiad and codfw T383398', diff saved to https://phabricator.wikimedia.org/P72118 and previous config saved to /var/cache/conftool/dbconfig/20250116-123656-marostegui.json [12:37:01] T383398: Reorganize and clean existing pc1-pc5 sections - https://phabricator.wikimedia.org/T383398 [12:38:43] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [12:38:43] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:44] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbprov1002.eqiad.wmnet [12:39:25] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbprov2001.codfw.wmnet [12:39:26] (03PS1) 10Marostegui: mariadb: Move pc1014 and pc2014 to pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1111966 (https://phabricator.wikimedia.org/T383234) [12:39:52] (03CR) 10Jelto: [C:03+2] Rename mw233[5-8] to wikikube-worker223[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1111965 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [12:40:05] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2335-2338].codfw.wmnet [12:41:04] (03PS1) 10Marostegui: wmnet: Update pc5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1111967 (https://phabricator.wikimedia.org/T383398) [12:42:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2335-2338].codfw.wmnet [12:42:38] (03CR) 10Marostegui: [C:03+2] wmnet: Update pc5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1111967 (https://phabricator.wikimedia.org/T383398) (owner: 10Marostegui) [12:42:42] !log marostegui@dns1006 START - running authdns-update [12:44:27] !log marostegui@dns1006 END - running authdns-update [12:45:34] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2335 to wikikube-worker2236 [12:45:56] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:45:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [12:45:58] status [12:48:34] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [12:48:34] status [12:49:06] (03PS2) 10Marostegui: mariadb: Move pc1016 and pc2014 to pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1111966 (https://phabricator.wikimedia.org/T383234) [12:49:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2335 to wikikube-worker2236 - jelto@cumin1002" [12:49:29] (03PS3) 10Marostegui: mariadb: Move pc1016 and pc2016 to pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1111966 (https://phabricator.wikimedia.org/T383234) [12:50:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2335 to wikikube-worker2236 - jelto@cumin1002" [12:50:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:50:15] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2236 [12:50:30] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [12:50:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2236 [12:51:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2335 to wikikube-worker2236 [12:51:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:50] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:52:51] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbprov2001.codfw.wmnet [12:53:18] (03CR) 10Ladsgroup: [C:03+1] mariadb: Move pc1016 and pc2016 to pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1111966 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [12:53:26] (03CR) 10Marostegui: [C:03+2] mariadb: Move pc1016 and pc2016 to pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1111966 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [12:53:41] (03CR) 10Kamila Součková: [C:03+1] wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:54:07] jynus: missed your message earlier, the page came from icinga and not alertmanager FWIW [12:54:10] (03CR) 10Kamila Součková: [C:03+1] "OK, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:55:40] FIRING: KubernetesRsyslogDown: rsyslog on mw2336:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2336 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:54] godog: I see [12:56:14] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbprov2002.codfw.wmnet [12:57:39] godog: Still the downtime was sent but not processed? [12:59:22] marostegui: by icinga? that I'm not sure about [12:59:43] godog: It was sent by the cookbook from what I can see [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1300) [13:00:23] ah, possible yeah the cookbook sent the downtime to icinga and it wasn't processed in time or at all [13:00:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2336:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:00:47] (03CR) 10Phuedx: [C:04-1] "This enables fetching experiment configs from MPIC for all logged-in users. One moment…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [13:01:10] godog: Is that something we should check? [13:02:53] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111944 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [13:03:08] marostegui: if it becomes a regular occurrence then yes definitely [13:03:16] godog: ok! thanks [13:04:05] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [13:04:12] sure np! [13:05:48] (03CR) 10Hashar: [C:03+2] Gerrit 3.10.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111944 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [13:06:25] (03Abandoned) 10Alexandros Kosiaris: Add various .wikimedia.org domains to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074365 (https://phabricator.wikimedia.org/T374997) (owner: 10Alexandros Kosiaris) [13:06:29] (03Merged) 10jenkins-bot: Gerrit 3.10.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111944 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [13:08:43] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2336 to wikikube-worker2237 [13:08:47] !log hashar@deploy2002 Started deploy [gerrit/gerrit@5c2347d]: Gerrit to 3.10.4 - T383597 [13:08:55] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@5c2347d]: Gerrit to 3.10.4 - T383597 (duration: 00m 08s) [13:09:09] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:09:12] (03CR) 10Filippo Giunchedi: wmcs: Migrate network saturation alerts to the alerts.git repository (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [13:09:49] (03CR) 10Filippo Giunchedi: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [13:10:07] jouncebot: now and next [13:10:07] For the next 0 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1300) [13:10:27] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] site: add prometheus200[78] [puppet] - 10https://gerrit.wikimedia.org/r/1111256 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [13:11:35] I am going to upgrade Gerrit from 3.10.2 to 3.10.4 [13:12:27] !log hashar@deploy2002 Started deploy [gerrit/gerrit@5c2347d]: Gerrit to 3.10.4 - T383597 [13:12:38] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@5c2347d]: Gerrit to 3.10.4 - T383597 (duration: 00m 10s) [13:15:42] FIRING: [2x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:18:01] !log Upgraded Gerrit to 3.10.4 [13:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:29] FIRING: [6x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:42] RESOLVED: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:22] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2336 to wikikube-worker2237 - jelto@cumin1002" [13:23:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2336 to wikikube-worker2237 - jelto@cumin1002" [13:23:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:23:41] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2237 [13:23:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2237 [13:24:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2336 to wikikube-worker2237 [13:25:46] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2337 to wikikube-worker2238 [13:26:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:29:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2337 to wikikube-worker2238 - jelto@cumin1002" [13:29:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2337 to wikikube-worker2238 - jelto@cumin1002" [13:29:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:29:50] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2238 [13:30:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2238 [13:30:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2337 to wikikube-worker2238 [13:32:29] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2338 to wikikube-worker2239 [13:32:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:34:12] (03CR) 10Jelto: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1111951 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:35:49] !log jynus@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:35:49] !log jynus@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbprov2002.codfw.wmnet [13:36:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2338 to wikikube-worker2239 - jelto@cumin1002" [13:36:39] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbprov2002.codfw.wmnet [13:36:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2338 to wikikube-worker2239 - jelto@cumin1002" [13:36:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:47] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2239 [13:37:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2239 [13:37:17] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2338 to wikikube-worker2239 [13:38:02] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2236.codfw.wmnet wikikube-worker2237.codfw.wmnet wikikube-worker2238.codfw.wmnet wikikube-worker2239.codfw.wmnet on all recursors [13:38:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2236.codfw.wmnet wikikube-worker2237.codfw.wmnet wikikube-worker2238.codfw.wmnet wikikube-worker2239.codfw.wmnet on all recursors [13:39:22] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:56] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1439-1442].eqiad.wmnet [13:40:01] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw14[39-42] -> wikikube-worker11[07-10] [puppet] - 10https://gerrit.wikimedia.org/r/1111951 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [13:40:01] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2236.codfw.wmnet with OS bookworm [13:40:12] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2236 [13:40:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:42:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1439-1442].eqiad.wmnet [13:42:52] (03PS1) 10Filippo Giunchedi: blackbox: require package blackbox to assemble config [puppet] - 10https://gerrit.wikimedia.org/r/1112004 [13:43:13] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1439 to wikikube-worker1107 [13:43:57] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2236 - jelto@cumin1002" [13:44:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2236 - jelto@cumin1002" [13:44:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:13] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2236.codfw.wmnet 112.32.192.10.in-addr.arpa 2.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2236.codfw.wmnet 112.32.192.10.in-addr.arpa 2.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:17] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2236 [13:44:23] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [13:44:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2236 [13:44:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2236 [13:44:57] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2237.codfw.wmnet with OS bookworm [13:45:07] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2237 [13:45:13] (03PS1) 10Marostegui: site.pp: Reorganize pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1112005 (https://phabricator.wikimedia.org/T383234) [13:45:53] I think we have a bit of a congestion with netbox locks [13:46:06] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1112005 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:46:59] kamila_: is yours running or is it waiting for confirmation? [13:47:19] jynus: running [13:47:37] me and j.elto are doing batch-reimages [13:47:43] but I can let you go first :D [13:47:58] nope, no issue, it will wait [13:48:12] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1439 to wikikube-worker1107 - kamila@cumin1002" [13:48:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1439 to wikikube-worker1107 - kamila@cumin1002" [13:48:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:17] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1107 [13:48:58] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:49:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1107 [13:49:36] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1440 to wikikube-worker1108 [13:49:37] (03PS1) 10Marostegui: mariadb: Add pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1112007 (https://phabricator.wikimedia.org/T383234) [13:49:50] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [13:49:54] cookbooks don't implement a faire queing system 😔 [13:50:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1439 to wikikube-worker1107 [13:50:22] (03CR) 10Marostegui: "Amir, would modules/profile/manifests/mediawiki/maintenance/parsercachepurging.pp be fine to commit even if it is not fully in production " [puppet] - 10https://gerrit.wikimedia.org/r/1112007 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:51:16] (03CR) 10Ladsgroup: [C:03+1] mariadb: Add pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1112007 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:46] (03CR) 10Marostegui: [C:03+2] mariadb: Add pc6 [puppet] - 10https://gerrit.wikimedia.org/r/1112007 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:51:50] (03CR) 10Ladsgroup: [C:03+1] "As long as it's reachable from mw, it should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/1112007 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:52:01] jynus: nope '^^ I will hold off on starting the rest of mine until you're done [13:52:27] I went from waiting for you to waiting for jelto [13:52:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2237 - jelto@cumin1002" [13:52:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2237 - jelto@cumin1002" [13:52:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:49] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2237.codfw.wmnet 113.32.192.10.in-addr.arpa 3.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:52:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2237.codfw.wmnet 113.32.192.10.in-addr.arpa 3.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:52:52] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2237 [13:52:54] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [13:53:05] wheee finally :D [13:53:08] it went through now [13:53:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2237 [13:53:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2237 [13:53:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:53:47] (03PS1) 10Hashar: gerrit: give it more time to terminate [puppet] - 10https://gerrit.wikimedia.org/r/1112011 (https://phabricator.wikimedia.org/T323754) [13:54:04] (03PS1) 10Aklapper: Phabricator data for WMF QLS: Add CBogen as recipient [puppet] - 10https://gerrit.wikimedia.org/r/1112013 (https://phabricator.wikimedia.org/T383884) [13:55:17] (03PS2) 10Filippo Giunchedi: blackbox: require package blackbox to assemble config [puppet] - 10https://gerrit.wikimedia.org/r/1112004 [13:55:18] (03PS1) 10Filippo Giunchedi: prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) [13:55:25] (03PS1) 10Abijeet Patro: Add z-index to `.tux-more-notices` [extensions/Translate] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112015 (https://phabricator.wikimedia.org/T383669) [13:55:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Translate] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112015 (https://phabricator.wikimedia.org/T383669) (owner: 10Abijeet Patro) [13:55:49] (03CR) 10CI reject: [V:04-1] prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:56:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2238.codfw.wmnet with OS bookworm [13:56:29] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2238 [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:57:30] (03PS1) 10Marostegui: conftool: Add pc6 and pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112017 (https://phabricator.wikimedia.org/T383234) [13:57:44] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [13:57:49] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbprov2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [13:57:49] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:57:50] !log jynus@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dbprov2002.codfw.wmnet [13:57:58] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:58:09] (03CR) 10Ladsgroup: [C:03+1] conftool: Add pc6 and pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112017 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [13:58:13] (03PS1) 10Lucas Werkmeister (WMDE): Disable distinct-values constraint checks on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) [13:58:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:58:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [13:58:34] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4811/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [13:58:44] jelto: kamila_ I am done. Please double checks your changes go through normally [13:58:57] not touching netbox or dns or network anymore [13:59:10] (03CR) 10Elukey: [C:03+1] sre.network.peering: use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111677 (owner: 10Volans) [13:59:17] ack, thanks jynus! [13:59:57] (03CR) 10Marostegui: [C:03+2] conftool: Add pc6 and pc7 [puppet] - 10https://gerrit.wikimedia.org/r/1112017 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1400) [14:00:04] subbu, abijeet, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:18] I can probably deploy in a few minutes [14:00:24] :D [14:00:37] * TheresNoTime cannae at the moment anyway ^^ [14:00:45] hello [14:00:58] (it’s the same procedure every thursday – there’s a meeting that theoretically takes place but most of the time nobody else shows up for, so I’m just waiting to see if it happens today or not) [14:01:04] hi, for information, I have upgraded Gerrit some minutes ago [14:01:11] yay [14:01:24] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2238 - jelto@cumin1002" [14:01:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2238 - jelto@cumin1002" [14:01:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:01:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2238.codfw.wmnet 115.32.192.10.in-addr.arpa 5.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:01:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2238.codfw.wmnet 115.32.192.10.in-addr.arpa 5.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:01:32] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2238 [14:01:44] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4812/co" [puppet] - 10https://gerrit.wikimedia.org/r/1112004 (owner: 10Filippo Giunchedi) [14:01:50] o/ [14:01:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:01:58] alright, I can deploy [14:02:10] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] blackbox: require package blackbox to assemble config [puppet] - 10https://gerrit.wikimedia.org/r/1112004 (owner: 10Filippo Giunchedi) [14:02:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2238 [14:02:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2238 [14:02:27] subbu: do you want to self-service your config change? [14:02:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2236.codfw.wmnet with reason: host reimage [14:02:43] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:02:55] PROBLEM - SSH on bast3007 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:03:55] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:04:16] Lucas_WMDE, i haven't done any in a real long time .. and would have to retrain ... if you are able to, that would help. [14:04:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2239.codfw.wmnet with OS bookworm [14:04:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2239 [14:04:50] if you still have SSH access, it should be as simple as `scap backport 1111325` [14:04:55] but I can also do it if you prefer [14:05:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:05:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1108 [14:05:14] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:05:15] Ya, can you? :) [14:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool pc6 eqiad and codfw dbmaint T383234', diff saved to https://phabricator.wikimedia.org/P72119 and previous config saved to /var/cache/conftool/dbconfig/20250116-140523-marostegui.json [14:05:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [14:05:28] T383234: Introduce pc6 and move one spare per dc to it - https://phabricator.wikimedia.org/T383234 [14:05:29] ok sure :) [14:05:33] ty [14:05:39] (03PS2) 10Filippo Giunchedi: prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) [14:05:56] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/Translate] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112015 (https://phabricator.wikimedia.org/T383669) (owner: 10Abijeet Patro) [14:06:00] (03CR) 10CI reject: [V:04-1] prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:06:15] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [14:06:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1108 [14:06:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2236.codfw.wmnet with reason: host reimage [14:06:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111325|Turn on Parsoid Read Views on test2wiki (T378645)]] [14:06:48] T378645: Roll out Parsoid readviews on test2wiki - https://phabricator.wikimedia.org/T378645 [14:07:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1440 to wikikube-worker1108 [14:07:01] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1441 to wikikube-worker1109 [14:07:37] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1442 to wikikube-worker1110 [14:08:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [14:09:01] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [14:10:04] (03PS1) 10Marostegui: wmnet: Add pc6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112026 (https://phabricator.wikimedia.org/T383234) [14:10:18] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:10:58] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2237.codfw.wmnet with reason: host reimage [14:11:13] (03CR) 10Ladsgroup: [C:03+1] wmnet: Add pc6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112026 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:11:33] (03CR) 10Marostegui: [C:03+2] wmnet: Add pc6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1112026 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:11:37] !log marostegui@dns1006 START - running authdns-update [14:12:20] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:13:22] !log marostegui@dns1006 END - running authdns-update [14:13:34] could someone give a +1 to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112019 btw? [14:13:51] the general idea got acked in https://phabricator.wikimedia.org/T369079#10463979 but I only uploaded the config change a few minutes before the window started [14:14:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2237.codfw.wmnet with reason: host reimage [14:15:17] (03PS3) 10Filippo Giunchedi: prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) [14:15:19] (03PS57) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [14:15:27] !log jelto@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:15:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72120 and previous config saved to /var/cache/conftool/dbconfig/20250116-141530-root.json [14:15:37] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1442 to wikikube-worker1110 - kamila@cumin1002" [14:15:44] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [14:16:05] by the way… is it just me or has scap backport gotten pretty slow of late? [14:16:12] (I’m aware this is an infuriatingly vague description ^^) [14:16:31] (03CR) 10DCausse: [C:03+1] Disable distinct-values constraint checks on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:16:40] but e.g. sync-testservers-k8s just took just under eight minutes… for *12* servers [14:16:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1442 to wikikube-worker1110 - kamila@cumin1002" [14:16:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:48] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1110 [14:17:14] !log lucaswerkmeister-wmde@deploy2002 ssastry, lucaswerkmeister-wmde: Backport for [[gerrit:1111325|Turn on Parsoid Read Views on test2wiki (T378645)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:18] T378645: Roll out Parsoid readviews on test2wiki - https://phabricator.wikimedia.org/T378645 [14:17:25] subbu: please test on WikimediaDebug :) [14:17:30] will do [14:17:52] (03PS1) 10Marostegui: pc2016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112031 (https://phabricator.wikimedia.org/T383234) [14:17:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1110 [14:17:56] Special:Random shows me the “experimental feature” notice \o/ [14:18:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:18:09] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2239.codfw.wmnet 116.32.192.10.in-addr.arpa 6.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:18:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2239.codfw.wmnet 116.32.192.10.in-addr.arpa 6.1.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:18:12] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2239 [14:18:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2239 [14:18:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2239 [14:18:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1442 to wikikube-worker1110 [14:18:33] (03CR) 10Marostegui: [C:03+2] pc2016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1112031 (https://phabricator.wikimedia.org/T383234) (owner: 10Marostegui) [14:18:42] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:20:23] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:20:40] Lucas_WMDE, lgtm. [14:20:42] !log lucaswerkmeister-wmde@deploy2002 ssastry, lucaswerkmeister-wmde: Continuing with sync [14:20:51] great, thanks! [14:21:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1109 [14:22:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1109 [14:22:53] (03PS58) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [14:22:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1441 to wikikube-worker1109 [14:23:31] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1107.eqiad.wmnet wikikube-worker1108.eqiad.wmnet wikikube-worker1109.eqiad.wmnet wikikube-worker1110.eqiad.wmnet on all recursors [14:23:33] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10466701 (10elukey) To keep archives happy - I followed up with Valerie and these high density servers sometimes need to slide forward to allow the more internal row of hot swap... [14:23:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1107.eqiad.wmnet wikikube-worker1108.eqiad.wmnet wikikube-worker1109.eqiad.wmnet wikikube-worker1110.eqiad.wmnet on all recursors [14:25:26] (03PS59) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) [14:26:15] (03CR) 10CDobbins: "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [14:26:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2236.codfw.wmnet with OS bookworm [14:26:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1108.eqiad.wmnet with OS bookworm [14:26:54] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1109.eqiad.wmnet with OS bookworm [14:26:54] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1108 [14:26:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1108 [14:26:57] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1110.eqiad.wmnet with OS bookworm [14:26:57] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1109 [14:26:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1109 [14:27:00] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1110 [14:27:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1110 [14:27:03] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002, dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383871#10466717 (10jcrespo) a:05jcrespo→03None This is done from my side. Let me know @Papaul if you prefer one task per dc. [14:27:07] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111325|Turn on Parsoid Read Views on test2wiki (T378645)]] (duration: 20m 22s) [14:27:15] T378645: Roll out Parsoid readviews on test2wiki - https://phabricator.wikimedia.org/T378645 [14:27:19] looks like it is done. [14:27:28] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002, dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383871#10466725 (10jcrespo) [14:27:48] (03Merged) 10jenkins-bot: Add z-index to `.tux-more-notices` [extensions/Translate] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1112015 (https://phabricator.wikimedia.org/T383669) (owner: 10Abijeet Patro) [14:28:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:25] Lucas_WMDE, thanks. [14:28:26] subbu: yup, should be done :) [14:28:27] np :) [14:28:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [14:29:08] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112015|Add z-index to `.tux-more-notices` (T383669)]] [14:29:11] T383669: On Special:Translate, the "more" button in messages that have multiple issues doesn't work - https://phabricator.wikimedia.org/T383669 [14:29:58] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1107.eqiad.wmnet on all recursors [14:30:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1107.eqiad.wmnet on all recursors [14:30:11] (03PS1) 10Marostegui: mariadb: Declare RBR in sanitariums. [puppet] - 10https://gerrit.wikimedia.org/r/1112040 (https://phabricator.wikimedia.org/T383795) [14:30:44] (03CR) 10Marostegui: "As it is now, it is a NOOP, as by default all these hosts already run RBR" [puppet] - 10https://gerrit.wikimedia.org/r/1112040 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [14:30:47] (03CR) 10Marostegui: [C:03+2] mariadb: Declare RBR in sanitariums. [puppet] - 10https://gerrit.wikimedia.org/r/1112040 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [14:33:01] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:33:33] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:33:52] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1108.eqiad.wmnet with OS bookworm [14:33:59] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1109.eqiad.wmnet with OS bookworm [14:34:02] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1110.eqiad.wmnet with OS bookworm [14:34:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72121 and previous config saved to /var/cache/conftool/dbconfig/20250116-143435-root.json [14:34:36] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1108.eqiad.wmnet with OS bookworm [14:34:39] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1108 [14:34:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1108 [14:35:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2237.codfw.wmnet with OS bookworm [14:35:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:59] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1112015|Add z-index to `.tux-more-notices` (T383669)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:03] T383669: On Special:Translate, the "more" button in messages that have multiple issues doesn't work - https://phabricator.wikimedia.org/T383669 [14:36:07] abijeet: can you test the change on WikimediaDebug? [14:36:07] (03CR) 10FNegri: wmcs: Migrate iowait stalling alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [14:36:15] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2239.codfw.wmnet with reason: host reimage [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:53] (03CR) 10Herron: [C:03+1] prometheus: reverse proxy for instances belonging to the host' site too [puppet] - 10https://gerrit.wikimedia.org/r/1112014 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [14:38:01] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:38:33] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:38:49] (03CR) 10Volans: [C:03+2] sre.network.peering: use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111677 (owner: 10Volans) [14:39:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2239.codfw.wmnet with reason: host reimage [14:40:42] (03CR) 10Vgutierrez: [C:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [14:41:01] (03CR) 10FNegri: wmcs: Migrate network saturation alerts to the alerts.git repository (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [14:41:09] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1109.eqiad.wmnet with OS bookworm [14:41:12] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1109 [14:41:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1109 [14:41:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1110.eqiad.wmnet with OS bookworm [14:41:30] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1110 [14:41:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1110 [14:41:46] unfortunately I doubt I can test it myself, as Tacsipacsi didn’t mention which messages are affected :/ [14:42:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:42:26] (03PS2) 10Phuedx: Enable MetricsPlatform extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) [14:42:35] (03CR) 10Phuedx: Enable MetricsPlatform extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [14:43:50] abijeet: ping? [14:45:10] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2238.codfw.wmnet with OS bookworm [14:45:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2238.codfw.wmnet with OS bookworm [14:45:39] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2238 [14:45:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2238 [14:45:59] (03Merged) 10jenkins-bot: sre.network.peering: use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111677 (owner: 10Volans) [14:46:13] I guess I’ll just roll it out now… [14:46:16] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Continuing with sync [14:46:27] it’s a CSS change, it can’t break the wikis too badly [14:47:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [14:48:57] FIRING: [6x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:26] last famous words? :) [14:49:32] mmhh checking [14:49:41] what's up (or down?) [14:49:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72122 and previous config saved to /var/cache/conftool/dbconfig/20250116-144940-root.json [14:49:52] !incidents [14:49:53] 5601 (ACKED) [6x] ProbeDown sre (probes/service) [14:49:53] 5600 (RESOLVED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [14:49:53] 5596 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [14:49:55] could it be expected due to renames o is it a real issue? [14:50:06] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1108.eqiad.wmnet with reason: host reimage [14:50:31] jynus: I would be surprised [14:50:33] I'm reimaging some wikikube nodes in codfw, they should be all depooled. I can check [14:50:42] the last 200 reimages did not cause such issues [14:50:58] I have no idea what to do about alerts like this [14:51:05] I’m letting the scap continue for now, let me know if I should Ctrl+C [14:51:06] thanks, just trying to dismiss options [14:51:08] Lucas_WMDE: I think this is a cachebust [14:51:12] (it’s done with the k8s part now) [14:51:20] you can continue [14:51:23] 14:33 <+icinga-wm> PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff [14:51:25] akc [14:51:26] ^ related? [14:51:27] * ack [14:51:34] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112015|Add z-index to `.tux-more-notices` (T383669)]] (duration: 22m 26s) [14:51:38] T383669: On Special:Translate, the "more" button in messages that have multiple issues doesn't work - https://phabricator.wikimedia.org/T383669 [14:51:49] should I continue with more deploys or wait a bit? [14:51:50] hmmm those resolved themselves though [14:51:57] * Lucas_WMDE waits for now [14:52:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web (k8s) 2.615s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:53:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1108.eqiad.wmnet with reason: host reimage [14:53:57] RESOLVED: [6x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:54:37] Lucas_WMDE, sorry, back now [14:54:40] Can I check [14:54:48] abijeet: it’s deployed everywhere now [14:54:57] so if you want you can test it without WikimediaDebug ^^ [14:55:39] I'm sorry, my dog was making a ruckus about someone at the door. [14:56:07] ah, fun :/ [14:56:07] (03CR) 10CDanis: [C:03+1] aptrepo: allow importing conftool from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1111945 (owner: 10Giuseppe Lavagetto) [14:56:11] I hope they calmed down now [14:57:01] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1109.eqiad.wmnet with reason: host reimage [14:57:10] Yup yup, I'm testing the patch [14:57:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web (k8s) 2.615s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:57:20] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1110.eqiad.wmnet with reason: host reimage [14:58:00] (03PS1) 10Ladsgroup: dbconfig: Order json output entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112047 [14:59:02] jouncebot: next [14:59:02] In 1 hour(s) and 0 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1600) [14:59:20] (03CR) 10Tacsipacsi: "Q999999999 may at one point start to exist (we’ve reached this magnitude, with Q100000001, over four years ago, so it’s not even unimagina" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [14:59:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2239.codfw.wmnet with OS bookworm [14:59:54] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002, dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383871#10466871 (10Papaul) @jcrespo thank you yes please one task per dc. [14:59:54] !incidents [14:59:55] 5601 (RESOLVED) [6x] ProbeDown sre (probes/service) [14:59:55] 5600 (RESOLVED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [14:59:55] 5596 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:00:47] Lucas_WMDE, looks OK. Thanks! [15:00:56] ok, thanks for checking! [15:02:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1109.eqiad.wmnet with reason: host reimage [15:03:05] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2238.codfw.wmnet with reason: host reimage [15:04:38] I’ll deploy my other two changes (separately), shout if I should interrupt [15:04:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72123 and previous config saved to /var/cache/conftool/dbconfig/20250116-150446-root.json [15:05:11] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383894 (10jcrespo) 03NEW [15:05:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1110.eqiad.wmnet with reason: host reimage [15:05:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002 - https://phabricator.wikimedia.org/T383871#10466900 (10jcrespo) [15:06:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1001, dbprov1002 - https://phabricator.wikimedia.org/T383871#10466904 (10jcrespo) >>! In T383871#10466868, @Papaul wrote: > @jcrespo thank you yes please one task per dc. Done. Split on this and T383894. [15:06:19] (03CR) 10Lucas Werkmeister (WMDE): "I would rather not put an ID here that’s not a valid `ItemId`. We can add a few more digits to the number later if we want (though `Int32E" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [15:06:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [15:06:33] (03CR) 10Ladsgroup: "I can't test this in mwdebug (the request never gets routed it seems) but locally works fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112047 (owner: 10Ladsgroup) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:02] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:07:16] Lucas_WMDE: Hi, let me know when you're done. Thanks! [15:07:23] (03Merged) 10jenkins-bot: Disable distinct-values constraint checks on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [15:07:32] Amir1: how long do you need? I could also take a break between the two changes [15:07:50] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]] [15:07:54] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [15:08:46] it's a mw config, so it should be fast [15:09:02] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383894#10466918 (10jcrespo) [15:09:29] ok [15:09:29] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10466920 (10Jelto) [15:09:36] let’s see if my WBQC backport merges in time or not ^^ [15:09:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2238.codfw.wmnet with reason: host reimage [15:09:55] oh, it already has a failing build -.- [15:10:44] (03CR) 10CI reject: [V:04-1] Check known-good regex patterns directly [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:10:58] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "try again, clone from phabricator broke for no apparent reason" [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:13:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1108.eqiad.wmnet with OS bookworm [15:13:33] I doubt there’s anything I can test for my config change so I’ll just sync it directly [15:13:36] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:13:39] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:13:41] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [15:14:27] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1107.eqiad.wmnet with OS bookworm [15:14:31] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1107 [15:14:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1107 [15:18:42] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112019|Disable distinct-values constraint checks on Commons (T369079)]] (duration: 10m 51s) [15:18:46] T369079: Update `UniqueValueChecker` to query a list of endpoints - https://phabricator.wikimedia.org/T369079 [15:19:05] Amir1: apparently my backport needs 4 more minutes so feel free to go ahead now [15:19:16] nah, I'm good [15:19:18] don't worry [15:19:23] ok ^^ [15:19:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72124 and previous config saved to /var/cache/conftool/dbconfig/20250116-151950-root.json [15:20:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:20:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1109.eqiad.wmnet with OS bookworm [15:21:37] (03PS11) 10Tiziano Fogli: thanos-rule: manage retention setting [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) [15:21:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1103.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker1274.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1252.eqiad.wm [15:21:53] ikube-worker1315.eqiad.wmnet, parse1009.eqiad.wmnet, wikikube-worker1247.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, mw1488.eqiad.wmnet, parse1010.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1059.eqiad.wmnet, wikikube-worker1278.eqiad. [15:21:53] ikikube-worker1106.eqiad.wmnet, wikikube-worker1299.eqiad.wmnet, wikikube-worker1257.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1062.eqiad.w https://wikitech.wikimedia.org/wiki/PyBal [15:22:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, parse1013.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, parse [15:22:03] ad.wmnet, mw1470.eqiad.wmnet, mw1484.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, mw1488.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, parse1010.eqiad.wmnet, wikikube-worker1056.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1106.eqiad.wmnet, mw1465.eqiad.wmnet, wikikube-worker1261.eqiad.wmnet, mw1466.eq [15:22:03] t, wikikube-worker1098.eqiad.wmnet, mw1469.eqiad.wmnet, wikikube-worker1102.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube-w https://wikitech.wikimedia.org/wiki/PyBal [15:24:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1110.eqiad.wmnet with OS bookworm [15:27:13] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:27:18] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [15:29:12] (03Merged) 10jenkins-bot: Check known-good regex patterns directly [extensions/WikibaseQualityConstraints] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111949 (https://phabricator.wikimedia.org/T380751) (owner: 10Lucas Werkmeister (WMDE)) [15:29:24] 06SRE: Add x-analytics nocookie=1 and x-tls-sess to webrequest-sampled-live stream - https://phabricator.wikimedia.org/T383900 (10fgiunchedi) 03NEW [15:29:43] (03PS1) 10Brouberol: airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 [15:29:43] (03PS1) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) [15:29:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111949|Check known-good regex patterns directly (T380751)]] [15:29:49] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:29:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2238.codfw.wmnet with OS bookworm [15:30:05] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1107.eqiad.wmnet with reason: host reimage [15:30:38] (03PS1) 10Lucas Werkmeister (WMDE): Increase nonexistent item ID for Commons constraint checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112054 [15:30:45] (03PS2) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) [15:31:07] (03CR) 10Lucas Werkmeister (WMDE): "⇒ I2df4b711d5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112019 (https://phabricator.wikimedia.org/T369079) (owner: 10Lucas Werkmeister (WMDE)) [15:31:13] !log homer 'lsw1-c3-codfw*' commit 'T377877' [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:17] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:32:03] (03CR) 10Tiziano Fogli: [C:03+2] "Thanks for the review. I'm merging it now." [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [15:32:07] !log homer 'cr*codfw*' commit 'T377877' [15:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 72, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:43] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2236-2239].codfw.wmnet [15:33:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2236-2239].codfw.wmnet [15:33:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1107.eqiad.wmnet with reason: host reimage [15:34:12] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383862#10467129 (10Jelto) [15:36:11] tappof: \o/ [15:36:21] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1111949|Check known-good regex patterns directly (T380751)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:36:25] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:36:36] I’ll quickly test that format constraints are still working at all [15:37:07] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:37:11] yup, looks good [15:38:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:39:08] !incidents [15:39:09] 5602 (UNACKED) ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad) [15:39:09] 5601 (RESOLVED) [6x] ProbeDown sre (probes/service) [15:39:09] 5600 (RESOLVED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [15:39:09] 5596 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:39:12] !ack 5602 [15:39:12] 5602 (ACKED) ATSBackendErrorsHigh cache_text sre (eventstreams.discovery.wmnet eqiad) [15:39:24] oncall' spicy today [15:39:39] (03PS1) 10Jelto: Rename the remaining mw nodes to wikikube-worker224[0-2] 🥳 [puppet] - 10https://gerrit.wikimedia.org/r/1112055 (https://phabricator.wikimedia.org/T377877) [15:41:38] I'm looking for anything obviously wrong with eventstreams btw, no smoking gun yet [15:42:17] (03PS3) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) [15:42:39] Lucas_WMDE: any chance eventstreams is impacted by the latest syncs ? [15:42:46] I wouldn’t think so [15:43:02] ok thank you [15:43:24] (03PS1) 10Arnaudb: peopleweb: request timeout to allow downloading larger files [puppet] - 10https://gerrit.wikimedia.org/r/1112056 (https://phabricator.wikimedia.org/T383750) [15:43:24] (03CR) 10Arnaudb: "there is other timeouts to bump if this one doesn't fix the situation" [puppet] - 10https://gerrit.wikimedia.org/r/1112056 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [15:44:18] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111949|Check known-good regex patterns directly (T380751)]] (duration: 14m 32s) [15:44:21] T380751: [SW] Update format constraint regex checks to stop errors from shellbox-constraints in the logs - https://phabricator.wikimedia.org/T380751 [15:44:29] Amir1: I’m done (but godog is looking into something) [15:44:48] thank you [15:44:58] to the deployment and beyond [15:45:03] (03CR) 10Dzahn: [C:03+2] Phabricator data for WMF QLS: Add CBogen as recipient [puppet] - 10https://gerrit.wikimedia.org/r/1112013 (https://phabricator.wikimedia.org/T383884) (owner: 10Aklapper) [15:45:08] (03CR) 10Ladsgroup: [C:03+2] dbconfig: Order json output entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112047 (owner: 10Ladsgroup) [15:45:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112047 (owner: 10Ladsgroup) [15:46:10] (03Merged) 10jenkins-bot: dbconfig: Order json output entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112047 (owner: 10Ladsgroup) [15:46:40] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1112047|dbconfig: Order json output entries]] [15:47:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:48:02] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:48:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:48:34] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:48:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:48:53] (03PS1) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) [15:51:14] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [15:51:38] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [15:52:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1107.eqiad.wmnet with OS bookworm [15:53:02] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1112047|dbconfig: Order json output entries]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:53:39] (03PS2) 10Brouberol: airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) [15:53:40] (03PS4) 10Brouberol: airflow: refactor/DRY the volume/volumeMounts accross containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) [15:53:40] (03PS2) 10Brouberol: airflow: define a K8sPodOperator pod template for pods needing access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) [15:54:21] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1107-1110].eqiad.wmnet [15:54:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1107-1110].eqiad.wmnet [15:54:43] (03PS2) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [15:54:44] (03PS1) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [15:55:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10467190 (10kamila) [15:55:40] (03CR) 10CI reject: [V:04-1] Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:55:43] (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [15:57:16] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:58:41] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [15:58:48] (03CR) 10JMeybohm: [C:03+1] "Whohoo! 🌷" [puppet] - 10https://gerrit.wikimedia.org/r/1112055 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [15:59:23] (03PS1) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [15:59:36] 10ops-codfw, 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903 (10elukey) 03NEW [16:00:04] brennen and dduvall: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1600) [16:00:22] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:04:38] (03CR) 10Jelto: [C:03+1] "As an intermediate fix we could try that (although I'm not sure if the timeout comes from varnish or envoy without debugging it myself). L" [puppet] - 10https://gerrit.wikimedia.org/r/1112056 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [16:05:32] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112047|dbconfig: Order json output entries]] (duration: 18m 52s) [16:07:08] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10467256 (10MatthewVernon) [16:07:30] (03PS1) 10Kamila Součková: wikikube: rename mw14[48-49,60-63] -> wikikube-worker101[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1112060 (https://phabricator.wikimedia.org/T365571) [16:08:55] (03CR) 10Arnaudb: [C:03+2] peopleweb: request timeout to allow downloading larger files [puppet] - 10https://gerrit.wikimedia.org/r/1112056 (https://phabricator.wikimedia.org/T383750) (owner: 10Arnaudb) [16:13:18] (03PS1) 10Hnowlan: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112061 [16:14:27] (03CR) 10Btullis: [C:03+1] airflow: deploy hive config under both hadoop and spark config dirs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112052 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [16:14:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1273:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1273 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:14:44] (03CR) 10Clément Goubert: [C:03+1] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112061 (owner: 10Hnowlan) [16:15:16] (03CR) 10Hnowlan: [C:03+2] mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112061 (owner: 10Hnowlan) [16:16:33] (03Merged) 10jenkins-bot: mobileapps: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112061 (owner: 10Hnowlan) [16:17:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111630 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [16:19:01] 10ops-codfw, 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10467296 (10MatthewVernon) FWIW, from my perspective is that we do need to be able to hot-swap these drives; if that turns out to mean we need... [16:19:13] (03CR) 10JMeybohm: [C:03+1] wikikube: rename mw14[48-49,60-63] -> wikikube-worker101[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1112060 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [16:19:22] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:19:48] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:21:10] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:21:39] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:21:47] (03PS3) 10JMeybohm: Update calico-crds to calico v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111943 (https://phabricator.wikimedia.org/T341984) [16:21:47] (03PS2) 10JMeybohm: Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) [16:21:47] (03PS2) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [16:22:00] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw14[48-49,60-63] -> wikikube-worker101[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/1112060 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [16:22:13] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1448-1449,1460-1463].eqiad.wmnet [16:22:48] (03CR) 10CI reject: [V:04-1] Update calico to v3.29.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112058 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:22:53] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [16:24:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1273:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1273 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:24:54] (03PS1) 10Hnowlan: eventstreams: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112062 [16:25:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1448-1449,1460-1463].eqiad.wmnet [16:25:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:26:49] (03CR) 10Clément Goubert: [C:03+1] eventstreams: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112062 (owner: 10Hnowlan) [16:26:54] (03CR) 10Scott French: [C:03+1] eventstreams: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112062 (owner: 10Hnowlan) [16:27:23] (03CR) 10Hnowlan: [C:03+2] eventstreams: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112062 (owner: 10Hnowlan) [16:28:30] (03Merged) 10jenkins-bot: eventstreams: bump memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112062 (owner: 10Hnowlan) [16:28:54] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1448 to wikikube-worker1111 [16:29:13] (03CR) 10Hashar: [C:03+1] "I have verified on the deployment server that the IP addresses are present:" [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [16:29:14] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:30:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:08] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1449 to wikikube-worker1112 [16:33:27] (03PS1) 10Volans: Resolve dependency issues related to Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/1112065 [16:33:27] (03PS1) 10Volans: Add support for Python 3.13 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112066 [16:34:05] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1448 to wikikube-worker1111 - kamila@cumin1002" [16:34:18] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:34:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1448 to wikikube-worker1111 - kamila@cumin1002" [16:34:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:34:26] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1111 [16:35:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:35:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1111 [16:35:54] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [16:36:10] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:36:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1448 to wikikube-worker1111 [16:36:27] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1460 to wikikube-worker1113 [16:37:06] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [16:37:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [16:37:47] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1449 to wikikube-worker1112 - kamila@cumin1002" [16:37:52] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [16:37:52] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [16:38:09] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox [16:38:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1449 to wikikube-worker1112 - kamila@cumin1002" [16:38:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:11] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1112 [16:38:32] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1461 to wikikube-worker1114 [16:38:34] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:39:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1112 [16:39:53] !log manually restarting netbox service on netbox1003 to update interface validator [16:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1449 to wikikube-worker1112 [16:42:46] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1460 to wikikube-worker1113 - kamila@cumin1002" [16:43:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1460 to wikikube-worker1113 - kamila@cumin1002" [16:43:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:13] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1113 [16:43:23] (03CR) 10Elukey: [C:03+1] Resolve dependency issues related to Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/1112065 (owner: 10Volans) [16:43:29] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:43:43] (03CR) 10Elukey: [C:03+1] Add support for Python 3.13 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112066 (owner: 10Volans) [16:43:59] (03PS1) 10Urbanecm: [beta] CommunityConfiguration: Release on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112068 (https://phabricator.wikimedia.org/T383911) [16:44:03] jouncebot: nowandnext [16:44:03] For the next 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1600) [16:44:03] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1700) [16:44:13] (03CR) 10Urbanecm: [C:03+2] "beta only, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112068 (https://phabricator.wikimedia.org/T383911) (owner: 10Urbanecm) [16:44:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1113 [16:44:57] (03Merged) 10jenkins-bot: [beta] CommunityConfiguration: Release on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112068 (https://phabricator.wikimedia.org/T383911) (owner: 10Urbanecm) [16:44:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1460 to wikikube-worker1113 [16:45:21] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1462 to wikikube-worker1115 [16:47:02] 06SRE, 06Traffic: Define an event stream and schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467480 (10Ottomata) [16:47:18] 06SRE, 06Traffic: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467481 (10Ottomata) [16:47:20] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1461 to wikikube-worker1114 - kamila@cumin1002" [16:47:34] (03CR) 10Brouberol: "Reminder to merge https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1029 right after having deployed this " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [16:47:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1461 to wikikube-worker1114 - kamila@cumin1002" [16:47:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:41] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1114 [16:47:52] (03PS7) 10Cathal Mooney: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [16:48:03] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:48:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1114 [16:49:14] (03PS8) 10Cathal Mooney: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [16:49:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1461 to wikikube-worker1114 [16:49:36] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1463 to wikikube-worker1116 [16:51:50] (03CR) 10BCornwall: [V:03+1] "As we don't yet own the domains, ncmonitor would want to delete these entries the next time it runs. Let's get our hands on the domains fi" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [16:52:15] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1462 to wikikube-worker1115 - kamila@cumin1002" [16:52:21] 06SRE, 06Traffic: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10467505 (10Ottomata) Hi! It looks like [[ https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/commit/a4cc9ecad3d018487e7c215c605346b335... [16:52:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1462 to wikikube-worker1115 - kamila@cumin1002" [16:52:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:52:45] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1115 [16:52:56] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:53:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1115 [16:54:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1462 to wikikube-worker1115 [16:55:39] (03CR) 10Volans: [C:03+2] Resolve dependency issues related to Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/1112065 (owner: 10Volans) [16:55:44] (03CR) 10Volans: [C:03+2] Add support for Python 3.13 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112066 (owner: 10Volans) [16:56:37] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1463 to wikikube-worker1116 - kamila@cumin1002" [16:56:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1463 to wikikube-worker1116 - kamila@cumin1002" [16:56:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:56:42] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1116 [16:58:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1116 [16:59:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1463 to wikikube-worker1116 [16:59:43] 06SRE, 06Traffic: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914 (10Ottomata) 03NEW [16:59:46] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1111.eqiad.wmnet wikikube-worker1112.eqiad.wmnet wikikube-worker1113.eqiad.wmnet wikikube-worker1114.eqiad.wmnet wikikube-worker1115.eqiad.wmnet wikikube-worker1116.eqiad.wmnet on all recursors [16:59:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1111.eqiad.wmnet wikikube-worker1112.eqiad.wmnet wikikube-worker1113.eqiad.wmnet wikikube-worker1114.eqiad.wmnet wikikube-worker1115.eqiad.wmnet wikikube-worker1116.eqiad.wmnet on all recursors [16:59:56] 06SRE, 06Traffic: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467543 (10Ottomata) p:05Triage→03High [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1700). [17:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:26] o/ [17:00:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:00:56] o/ [17:02:47] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1111.eqiad.wmnet with OS bookworm [17:02:51] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1111 [17:02:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1111 [17:02:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1112.eqiad.wmnet with OS bookworm [17:02:56] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1112 [17:02:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1112 [17:02:58] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1113.eqiad.wmnet with OS bookworm [17:03:02] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1113 [17:03:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1113 [17:03:07] zabe: is your change on the board, I don't see it for some reason [17:03:13] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1115.eqiad.wmnet with OS bookworm [17:03:17] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1115 [17:03:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1115 [17:03:18] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1116.eqiad.wmnet with OS bookworm [17:03:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1116 [17:03:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1116 [17:03:28] yes [17:03:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1108522 [17:03:48] here the gerrit link [17:04:03] thanks [17:04:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS [17:04:32] v6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:04:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS6 [17:04:38] 6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:08:19] zabe: patch looks fine, happy to merge it in, but I am not sure what steps need to be taken post merge [17:09:42] we need to run scap on k8s afterwards, I can do that [17:09:53] nod, great, merging... [17:10:09] (03CR) 10JHathaway: [C:03+2] Add Apache configuration for wikipedia-zh-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [17:10:58] 06SRE, 06Traffic: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467589 (10Ottomata) [17:11:12] (03Merged) 10jenkins-bot: Resolve dependency issues related to Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/1112065 (owner: 10Volans) [17:11:12] (03Merged) 10jenkins-bot: Add support for Python 3.13 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112066 (owner: 10Volans) [17:11:31] (03CR) 10Alexandros Kosiaris: [C:03+1] "Not in love with the comment, but it's not yours and it should be fixed in both profiles in a different patch. Thanks for carrying it over" [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis) [17:11:32] 06SRE, 06Traffic: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10467592 (10Ottomata) [17:11:41] (03CR) 10Alexandros Kosiaris: [C:03+1] urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:12:31] (03CR) 10CDanis: [C:03+2] urldownloader: squid_exporter monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis) [17:14:17] (03CR) 10Giuseppe Lavagetto: [C:03+1] conftool: stub out extension configuration [puppet] - 10https://gerrit.wikimedia.org/r/1111703 (owner: 10CDanis) [17:14:20] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1114.eqiad.wmnet on all recursors [17:14:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1114.eqiad.wmnet on all recursors [17:14:49] zabe: merged [17:14:59] (03CR) 10CDanis: [C:03+2] conftool: stub out extension configuration [puppet] - 10https://gerrit.wikimedia.org/r/1111703 (owner: 10CDanis) [17:16:04] (03PS1) 10Cathal Mooney: Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) [17:16:05] 06SRE, 06Infrastructure-Foundations, 10netops: Netbox: execute interface validator in provision script for switch interfaces - https://phabricator.wikimedia.org/T383915 (10cmooney) 03NEW p:05Triage→03Low [17:17:42] jhathaway: thanks, could you do a puppet run on deploy2002? [17:17:53] yup... [17:18:10] (03CR) 10CI reject: [V:04-1] Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [17:18:40] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1111.eqiad.wmnet with reason: host reimage [17:18:50] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1112.eqiad.wmnet with reason: host reimage [17:19:03] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1115.eqiad.wmnet with reason: host reimage [17:19:04] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1113.eqiad.wmnet with reason: host reimage [17:19:14] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1116.eqiad.wmnet with reason: host reimage [17:20:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:22:12] !log zabe@deploy2002 Started scap sync-world: T380119 [17:22:15] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119 [17:22:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1111.eqiad.wmnet with reason: host reimage [17:22:58] zabe: done, also as an aside, what a long puppet run!!!! 😴 [17:23:07] !log zabe@deploy2002 sync-world aborted: T380119 (duration: 01m 16s) [17:23:30] !log zabe@deploy2002 Started scap sync-world: T380119 [17:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10467639 (10phaultfinder) [17:24:43] (03CR) 10Volans: Run validator when creating switch-interface in provision script (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [17:25:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1115.eqiad.wmnet with reason: host reimage [17:28:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1116.eqiad.wmnet with reason: host reimage [17:31:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1112.eqiad.wmnet with reason: host reimage [17:34:17] !log zabe@deploy2002 Finished scap sync-world: T380119 (duration: 10m 59s) [17:34:20] * zabe done [17:34:21] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119 [17:35:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1113.eqiad.wmnet with reason: host reimage [17:36:30] (03PS9) 10Cathal Mooney: Fr-tech provision script to assign IPs and switch ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) [17:37:39] (03PS2) 10Cathal Mooney: Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) [17:38:28] (03CR) 10Cathal Mooney: Run validator when creating switch-interface in provision script (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [17:38:47] (03PS3) 10JMeybohm: Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [17:38:48] (03PS1) 10JMeybohm: sq [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112071 [17:39:33] (03CR) 10CI reject: [V:04-1] Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [17:39:47] (03CR) 10CI reject: [V:04-1] Update staging-codfw to k8s 1.31, calico 3.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:39:54] (03CR) 10CI reject: [V:04-1] sq [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112071 (owner: 10JMeybohm) [17:40:45] (03PS3) 10Cathal Mooney: Run validator when creating switch-interface in provision script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) [17:40:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1111.eqiad.wmnet with OS bookworm [17:43:57] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112072 [17:44:17] (03CR) 10CDanis: [C:03+2] urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:44:35] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:44:37] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:44:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1254:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1254 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:44:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1115.eqiad.wmnet with OS bookworm [17:44:45] PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:45:47] (03CR) 10Volans: [C:03+1] "LGTM, make sure to test it that indeed does the validation :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1112069 (https://phabricator.wikimedia.org/T383915) (owner: 10Cathal Mooney) [17:47:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1116.eqiad.wmnet with OS bookworm [17:49:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:50:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1112.eqiad.wmnet with OS bookworm [17:54:08] (03CR) 10Btullis: airflow: refactor/DRY the volume/volumeMounts accross containers (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112053 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [17:54:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1113.eqiad.wmnet with OS bookworm [17:56:32] (03CR) 10Dzahn: [C:03+1] "ah! yea, makes sense of course!" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [17:57:54] (03CR) 10CDanis: urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [17:59:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:00:04] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1800). nyaa~ [18:00:05] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1800). nyaa~ [18:01:24] o/ [18:01:40] still wrangling a bit of preparation, but should be ready to start in a bit [18:10:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1011 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:10:54] (03CR) 10Effie Mouzeli: [C:03+1] mw-(web|api-ext)-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100555 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:11:01] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: switch mw-(web|api-ext)-next to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1100556 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:11:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10467716 (10phaultfinder) [18:14:17] (03PS1) 10BryanDavis: dumps(web): add reason when rejecting port 80 traffic [puppet] - 10https://gerrit.wikimedia.org/r/1112073 [18:14:52] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1100556 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:14:58] (03CR) 10Scott French: [C:03+2] hieradata: switch mw-(web|api-ext)-next to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1100556 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:15:20] (03PS1) 10Bernard Wang: Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 [18:15:59] (03CR) 10Scott French: [C:03+2] mw-(web|api-ext)-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100555 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:16:02] (03CR) 10CI reject: [V:04-1] Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (owner: 10Bernard Wang) [18:17:13] (03Merged) 10jenkins-bot: mw-(web|api-ext)-next: php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100555 (https://phabricator.wikimedia.org/T377040) (owner: 10Scott French) [18:17:21] (03CR) 10BryanDavis: "Adding vgutierrez as reviewer because they authored Ib291c1c95cb1b5170882bdb6d2b9484b54ac28f2" [puppet] - 10https://gerrit.wikimedia.org/r/1112073 (owner: 10BryanDavis) [18:20:27] waiting for puppet agent on deploy2002, but should be ready to deploy once that's done [18:20:36] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-01-16-121924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112075 (https://phabricator.wikimedia.org/T362286) [18:20:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:22:05] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 138398 [18:22:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138398 [18:22:40] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v5.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112072 (owner: 10Volans) [18:23:12] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-01-16-121924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112075 (https://phabricator.wikimedia.org/T362286) (owner: 10BryanDavis) [18:24:10] !log swfrench@deploy2002 Started scap sync-world: Deployment to switch next release files to 8.1 - T377040 [18:24:14] T377040: Turn up PHP 8.1-flavored k8s deployments for all MediaWiki services - https://phabricator.wikimedia.org/T377040 [18:24:24] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-01-16-121924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112075 (https://phabricator.wikimedia.org/T362286) (owner: 10BryanDavis) [18:25:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:28:00] !log swfrench@deploy2002 Finished scap sync-world: Deployment to switch next release files to 8.1 - T377040 (duration: 03m 50s) [18:33:29] all done on my end [18:35:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1011:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:35:50] (03CR) 10Xcollazo: [C:03+1] dumps(web): add reason when rejecting port 80 traffic [puppet] - 10https://gerrit.wikimedia.org/r/1112073 (owner: 10BryanDavis) [18:37:33] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v5.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/1112072 (owner: 10Volans) [18:37:35] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:37:52] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:38:32] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:38:55] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:39:26] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:40:58] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:45:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1063:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:49:44] (03CR) 10Majavah: [C:03+2] dumps(web): add reason when rejecting port 80 traffic [puppet] - 10https://gerrit.wikimedia.org/r/1112073 (owner: 10BryanDavis) [18:50:30] (03CR) 10CDanis: [C:03+2] urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [18:51:02] taavi: if mine is together with yours lgtm :) [18:51:07] ah it wasn't, cool [18:51:24] you're late for my merge :-P [19:00:05] brennen and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T1900). [19:04:05] (03PS2) 10Bernard Wang: Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) [19:04:49] (03CR) 10CI reject: [V:04-1] Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [19:04:51] RECOVERY - Router interfaces on cr1-magru is OK: OK: host 195.200.68.128, interfaces up: 48, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:05:55] (03PS1) 10Volans: Upstream release v5.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1112079 [19:06:02] o/ [19:06:36] !log 1.44.0-wmf.12 train (T382363): no current blockers and logs calm, rolling to all wikis. [19:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:40] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [19:06:41] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:43] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:07:03] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112080 (https://phabricator.wikimedia.org/T382363) [19:07:04] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112080 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:07:47] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112080 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:19:36] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.12 refs T382363 [19:19:40] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [19:24:33] FIRING: KubernetesCalicoDown: wikikube-worker1111.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1111.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:25:16] (03CR) 10Btullis: airflow: define a K8sPodOperator pod template for pods needing access to hadoop (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112057 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [19:29:33] FIRING: [3x] KubernetesCalicoDown: wikikube-worker1111.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:34:33] FIRING: [4x] KubernetesCalicoDown: wikikube-worker1111.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:38:11] (03PS12) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) [19:38:37] (03CR) 10CI reject: [V:04-1] prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [19:39:33] FIRING: [5x] KubernetesCalicoDown: wikikube-worker1111.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:41:10] (03PS7) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) [19:41:36] (03CR) 10CI reject: [V:04-1] prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [19:42:14] (03PS3) 10Bernard Wang: Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) [19:42:14] (03PS1) 10Bernard Wang: Enable web search AB test stream in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112082 [19:42:58] (03CR) 10CI reject: [V:04-1] Enable web search AB test stream in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112082 (owner: 10Bernard Wang) [19:44:40] (03PS13) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) [19:44:40] (03PS8) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) [19:48:48] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1114.eqiad.wmnet with OS bookworm [19:48:51] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1114 [19:48:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1114 [19:49:28] (03CR) 10FNegri: "I tested the latest patchset by running it in cloudgw1002, where there are quite a few kernel errors of different kinds." [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [19:50:07] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:04:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468140 (10phaultfinder) [20:04:49] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1114.eqiad.wmnet with reason: host reimage [20:08:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1114.eqiad.wmnet with reason: host reimage [20:15:16] (03CR) 10Kimberly Sarabia: "Having trouble testing this and seeing a difference after updating my LocalSettings. Would like additional validation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [20:15:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:27:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1114.eqiad.wmnet with OS bookworm [20:28:22] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1111-1116].eqiad.wmnet [20:28:22] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker[1111-1116].eqiad.wmnet [20:29:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10468299 (10kamila) [20:29:33] FIRING: [6x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:30:23] (03CR) 10Volans: [C:03+2] Upstream release v5.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1112079 (owner: 10Volans) [20:32:27] (03CR) 10Kimberly Sarabia: [C:03+1] Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [20:33:43] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:34:01] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:34:33] FIRING: [7x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:34:49] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:34:55] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sat 01 Feb 2025 08:26:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:37:51] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:39:33] FIRING: [9x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:42:22] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:33] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:44:50] (03PS1) 10Catrope: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112093 [20:45:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468397 (10phaultfinder) [20:46:35] (03CR) 10CDanis: [C:03+2] Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112093 (owner: 10Catrope) [20:47:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:47:32] (03Merged) 10jenkins-bot: Upstream release v5.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1112079 (owner: 10Volans) [20:47:37] (03Merged) 10jenkins-bot: Update chart-renderer service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112093 (owner: 10Catrope) [20:49:33] FIRING: [10x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:50:49] !log catrope@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [20:51:34] !log catrope@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [20:54:33] FIRING: [11x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:57:29] !log catrope@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [20:58:03] !log catrope@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [20:58:40] !log catrope@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [20:59:07] !log catrope@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T2100). [21:00:05] JSherman, cjming, Jdlrobson, and aqu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:45] I'm going to move my patches to the web team deployment window that follows this one since the window is pretty packed! [21:00:53] here; noting that we used to get a t-shirt for break + fix, not just a sticker :-( [21:01:07] o/ [21:01:10] i can deploy [21:01:29] thanks! Mine is config only [21:01:36] Jdlrobson: thanks for that - i can do those once i slog thru the queue [21:01:45] JSherman: do you want to self-deploy? [21:02:00] Sure thing [21:02:20] cool - ping me when you're done and i'll do the rest of the queue [21:02:35] ack [21:02:58] (03CR) 10Jdlrobson: Configure streams for web empty search AB test (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [21:03:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111350 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [21:03:40] (03CR) 10Jdlrobson: Configure streams for web empty search AB test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [21:04:07] (03Merged) 10jenkins-bot: Increase Nuke max age to 90 days (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111350 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [21:04:24] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]] [21:04:28] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [21:05:53] (03CR) 10Kimberly Sarabia: [C:03+1] "I've been able to confirm this locally! This should be ready to backport imo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [21:08:46] !log jsn@deploy2002 jsn, chlod: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468499 (10phaultfinder) [21:09:44] !log jsn@deploy2002 jsn, chlod: Continuing with sync [21:17:36] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111350|Increase Nuke max age to 90 days (attempt 2) (T380846)]] (duration: 13m 12s) [21:17:40] T380846: Update $wgNukeMaxAge to 90 days in Nuke - https://phabricator.wikimedia.org/T380846 [21:17:46] cjming: all yours! [21:17:57] JSherman: thanks! [21:18:35] (03PS2) 10Phuedx: Beta Cluster: Update MetricsPlatform extension config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) [21:19:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:20:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:31] (03Merged) 10jenkins-bot: Beta Cluster: Update MetricsPlatform extension config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:20:46] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1111260|Beta Cluster: Update MetricsPlatform extension config (T381964)]] [21:20:50] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468530 (10phaultfinder) [21:26:47] !log cjming@deploy2002 cjming, phuedx: Backport for [[gerrit:1111260|Beta Cluster: Update MetricsPlatform extension config (T381964)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:50] !log cjming@deploy2002 cjming, phuedx: Continuing with sync [21:26:51] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:28:50] (03PS3) 10Phuedx: Enable MetricsPlatform extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) [21:33:42] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111260|Beta Cluster: Update MetricsPlatform extension config (T381964)]] (duration: 12m 55s) [21:33:46] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:34:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:35:16] (03Merged) 10jenkins-bot: Enable MetricsPlatform extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:35:32] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1111261|Enable MetricsPlatform extension everywhere (T381964)]] [21:36:28] Hi JSherman . Is the deploying window closed ? [21:36:58] aqu: i'm deploying -- doing my 3 patches and then will do yours [21:37:08] and then web team's patches [21:37:22] oh Thank you [21:40:07] !log cjming@deploy2002 cjming, phuedx: Backport for [[gerrit:1111261|Enable MetricsPlatform extension everywhere (T381964)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:40:11] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:40:12] !log cjming@deploy2002 cjming, phuedx: Continuing with sync [21:40:56] (03PS2) 10Phuedx: testwiki: Enable MetricsPlatform stream config fetching and merging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) [21:47:16] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111261|Enable MetricsPlatform extension everywhere (T381964)]] (duration: 11m 43s) [21:47:20] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:47:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:47:46] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [21:48:00] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:48:16] (03Merged) 10jenkins-bot: testwiki: Enable MetricsPlatform stream config fetching and merging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:48:33] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1111262|testwiki: Enable MetricsPlatform stream config fetching and merging (T381964)]] [21:49:10] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [21:49:24] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:49:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1111:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1111 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:54:29] !log cjming@deploy2002 phuedx, cjming: Backport for [[gerrit:1111262|testwiki: Enable MetricsPlatform stream config fetching and merging (T381964)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:33] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [21:54:35] !log cjming@deploy2002 phuedx, cjming: Continuing with sync [21:54:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1111:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1111 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:57:01] (03PS2) 10Aqu: Analytics - Set parameters to Refine content history reconcile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111630 (https://phabricator.wikimedia.org/T369845) [21:58:26] aqu: if you're still around, i'll do your patch here next [21:58:55] I'm here [21:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468611 (10phaultfinder) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250116T2200) [22:01:05] Hello! We will be using the window today I'll be back to deploy in a few minutes [22:01:24] !log uploaded cumin_5.0.0 to apt.wikimedia.org bullseye-wikimedia [22:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:34] hi toyofuku! i have one more patch to do - can i ping you when i'm done? [22:01:44] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111262|testwiki: Enable MetricsPlatform stream config fetching and merging (T381964)]] (duration: 13m 10s) [22:01:48] T381964: MetricsPlatform: Enable in production - https://phabricator.wikimedia.org/T381964 [22:01:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111630 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [22:02:43] (03Merged) 10jenkins-bot: Analytics - Set parameters to Refine content history reconcile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111630 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [22:02:51] (03PS1) 10Jdlrobson: Enable Vector 2022 and dark mode on Azerbaijani wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112101 (https://phabricator.wikimedia.org/T383942) [22:02:59] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1111630|Analytics - Set parameters to Refine content history reconcile (T369845)]] [22:03:03] T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment - https://phabricator.wikimedia.org/T369845 [22:04:33] cjming: no worries! lmk when you're all set [22:04:50] cool - thanks [22:07:11] aqu: ok to sync? [22:07:57] !log cjming@deploy2002 aqu, cjming: Backport for [[gerrit:1111630|Analytics - Set parameters to Refine content history reconcile (T369845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:08:17] I suppose [22:08:22] !log cjming@deploy2002 aqu, cjming: Continuing with sync [22:09:33] FIRING: [12x] KubernetesCalicoDown: mw1448.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:10:12] (03CR) 10Bernard Wang: Configure streams for web empty search AB test (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:11:19] (03PS4) 10Jdlrobson: Beta: Update schemas in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) [22:11:26] (03CR) 10Jdlrobson: [C:04-1] "Squashed into https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:13:12] I can see the deployed modifications here: https://meta.wikimedia.org/w/api.php?action=streamconfigs [22:13:12] TY ! [22:13:24] nice :) [22:15:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468699 (10phaultfinder) [22:15:37] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111630|Analytics - Set parameters to Refine content history reconcile (T369845)]] (duration: 12m 37s) [22:15:39] toyofuku: all yours! [22:15:41] T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment - https://phabricator.wikimedia.org/T369845 [22:15:46] Thank you! [22:15:50] Jdlrobson: here? [22:15:50] yw :) [22:17:10] Well, you have a second bc the host key has changed and it's not letting me ssh 🙃 [22:17:15] RoanKattouw: yup [22:17:48] do you need me to deploy? [22:18:43] I think I can figure it out! If you're itching to deploy you're certainly welcome but I should be up and running in a few minutes [22:19:17] no worries - lmk if you need me to do them [22:19:25] all good - I'm in now [22:19:31] great! [22:19:49] deploying from `deploy2002` today how exciting [22:20:06] starting with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715 [22:20:56] 😀 [22:21:02] Jdlrobson: Change '1111715' has dependencies '[1112074]', which are not merged or scheduled for backport [22:21:08] Doesn't look like that's true though? [22:23:15] I see - originally it was a dependency but the two patches were merged into this one [22:23:44] looking [22:23:55] Not sure why it's registering a dependency though - there's no `Depends-on` in the commit message [22:24:07] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112074/3?usp=related-change should be merged together [22:24:11] Presumably something to do with the base branch? [22:24:19] They are chained because 1111715 cannot be tested without the other [22:24:30] you need to +2 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112074/3?usp=related-change first [22:24:42] (03CR) 10Jdlrobson: [C:03+1] Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:24:47] Okay, so https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112074 _is_ getting deployed? [22:25:18] Yep sorry I -1ed the wrong patch [22:25:37] I see [22:25:46] So, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1112074 first, then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715? [22:25:52] (03CR) 10Jdlrobson: [C:04-1] "Squashed into https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715/4?usp=related-change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112082 (owner: 10Bernard Wang) [22:27:39] I skimmed it and it looks valid to me - proceeding with deploy [22:27:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:28:27] (03Merged) 10jenkins-bot: Configure streams for web empty search AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:28:43] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1112074|Configure streams for web empty search AB test (T380926)]] [22:28:47] T380926: Configure new streams for the mobile search A/B test - https://phabricator.wikimedia.org/T380926 [22:30:32] This is what I'm listening to while deploying: https://open.spotify.com/album/5L3PAo50R75rOZLlEvokZZ?si=536HV3kAR9OHTg0J_yIUSw [22:30:45] I'm seeing him live in June! [22:32:59] ooh irccloud does spotify embeds [22:33:24] !log toyofuku@deploy2002 bwang, toyofuku: Backport for [[gerrit:1112074|Configure streams for web empty search AB test (T380926)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:33:35] Jdlrobson: testable? [22:33:45] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:33:55] PROBLEM - SSH on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:33:59] (03CR) 10Jdlrobson: [C:03+1] Configure streams for web empty search AB test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112074 (https://phabricator.wikimedia.org/T380926) (owner: 10Bernard Wang) [22:34:07] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:34:45] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 537 bytes in 9.680 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:34:45] RECOVERY - SSH on moscovium is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:34:57] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sat 01 Feb 2025 08:26:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:35:37] toyofuku: trying to work that out now.. :) [22:35:42] lol [22:36:05] presumably we could send events to the stream but not sure how easy it is to see them and the sample rate is zero so...... [22:36:13] I'm good to proceed and we can test with the beta one? [22:38:17] toyofuku: i think it's working [22:38:20] sick [22:38:23] I'll proceed [22:38:25] !log toyofuku@deploy2002 bwang, toyofuku: Continuing with sync [22:38:26] https://www.irccloud.com/pastebin/sDJlIRij/ [22:38:36] I called the above and it was accepted so that's good enough for me [22:38:42] perfect [22:39:06] but not sure if that means owt.. the code is not wired up anywhere yet but it's not going to break anything :) [22:39:42] I mean, try it with a stream that doesn't exist and see if you get a diff response? [22:39:51] In any case, we're deploying 🙃 [22:40:17] i see no network event [22:40:29] perfect [22:40:45] (03PS1) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [22:41:27] (03CR) 10CI reject: [V:04-1] Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [22:44:41] And https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715 can just be +2ed since it only impacts beta cluster. [22:44:53] (we'll have to wait to test that one - I can test it later this afternoon) [22:45:23] Sadly I don't have +2 rights in mediawiki config so I think our only path is to deploy it? [22:45:31] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112074|Configure streams for web empty search AB test (T380926)]] (duration: 16m 47s) [22:45:35] T380926: Configure new streams for the mobile search A/B test - https://phabricator.wikimedia.org/T380926 [22:45:54] Jdlrobson: with that in mind, any preference for which one we do next? [22:47:46] Gonna do `1108135` while we wait since I'd like to finish as close to 3 as possible [22:48:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia) [22:48:27] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [22:48:42] (03Merged) 10jenkins-bot: Remove `wgVectorStickyHeader` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia) [22:49:00] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1108135|Remove `wgVectorStickyHeader` from InitialiseSettings.php (T332728)]] [22:49:04] T332728: Remove sticky header configuration (VectorStickyHeader) - https://phabricator.wikimedia.org/T332728 [22:49:09] (03PS2) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [22:49:17] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Docker [22:49:47] (03CR) 10CI reject: [V:04-1] Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [22:53:18] !log toyofuku@deploy2002 ksarabia, toyofuku: Backport for [[gerrit:1108135|Remove `wgVectorStickyHeader` from InitialiseSettings.php (T332728)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:53:23] !log toyofuku@deploy2002 ksarabia, toyofuku: Continuing with sync [22:53:34] proceeding as that one is removing old code [22:54:09] toyofuku: yep sounds good [22:54:16] no preference on order [22:54:28] I'll do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1107964 next and I can test that [22:54:42] (03PS3) 10Jdlrobson: Stop expanding sections by default on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) [22:54:53] Then deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715 and that's it right? [22:58:33] wfm! [22:58:42] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1107964 is the only user facing one [22:59:06] and if all is good we can just get rid of the associated MobileFrontend code. [23:00:43] sounds good [23:00:55] This one's almost done [23:01:07] Gonna try to be speedy with these last two bc it's my mom's birthday and I need to call her lol [23:01:13] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108135|Remove `wgVectorStickyHeader` from InitialiseSettings.php (T332728)]] (duration: 12m 13s) [23:01:14] :) [23:01:17] T332728: Remove sticky header configuration (VectorStickyHeader) - https://phabricator.wikimedia.org/T332728 [23:01:32] (03PS3) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [23:01:36] Chop Suey! might be appropriate deploying music then haha [23:01:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) (owner: 10Jdlrobson) [23:02:12] (03CR) 10CI reject: [V:04-1] Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [23:02:23] (03Merged) 10jenkins-bot: Stop expanding sections by default on Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) (owner: 10Jdlrobson) [23:02:26] lolll [23:02:38] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1107964|Stop expanding sections by default on Wiktionary (T376446)]] [23:02:42] T376446: Enable $wgMFCollapseSectionsByDefault on English Wiktionary - https://phabricator.wikimedia.org/T376446 [23:04:23] (03PS4) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [23:05:38] toyofuku: fix confirmed [23:06:02] Which fix? We haven't even deployed to test servers yet haha [23:06:08] the section collapsing? [23:06:10] oh nvm they're almost done [23:06:10] seems to be working for me? [23:06:11] :O [23:06:17] and it was working before chop suey! completeed [23:06:59] !log toyofuku@deploy2002 jdlrobson, toyofuku: Backport for [[gerrit:1107964|Stop expanding sections by default on Wiktionary (T376446)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:07:07] !log toyofuku@deploy2002 jdlrobson, toyofuku: Continuing with sync [23:07:13] confirmed it works also [23:07:25] maybe there [23:07:27] lol so how is it working ?! [23:07:28] 's a lag [23:12:07] (03PS5) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [23:14:02] So just https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111715?usp=search to go now? [23:14:11] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1107964|Stop expanding sections by default on Wiktionary (T376446)]] (duration: 11m 32s) [23:14:15] T376446: Enable $wgMFCollapseSectionsByDefault on English Wiktionary - https://phabricator.wikimedia.org/T376446 [23:14:15] Yep! [23:14:24] starting now [23:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468820 (10phaultfinder) [23:15:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) (owner: 10Jdlrobson) [23:16:44] (03Merged) 10jenkins-bot: Beta: Update schemas in InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) (owner: 10Jdlrobson) [23:17:00] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1111715|Beta: Update schemas in InitialiseSettings-labs.php (T382080)]] [23:17:04] T382080: Search recommendation clicks should trigger events - https://phabricator.wikimedia.org/T382080 [23:21:21] !log toyofuku@deploy2002 toyofuku, jdlrobson: Backport for [[gerrit:1111715|Beta: Update schemas in InitialiseSettings-labs.php (T382080)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:21:31] !log toyofuku@deploy2002 toyofuku, jdlrobson: Continuing with sync [23:21:40] proceeding as it's beta only [23:21:43] thx toyofuku ! Call your mom! [23:21:56] Jdlrobson: ready for testing if you have anything to test [23:22:02] hahaha not quite yet but all good!! [23:27:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:28:23] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111715|Beta: Update schemas in InitialiseSettings-labs.php (T382080)]] (duration: 11m 23s) [23:28:29] T382080: Search recommendation clicks should trigger events - https://phabricator.wikimedia.org/T382080 [23:28:35] all done! thanks for playing everyone [23:28:37] stream feid! [23:28:58] <3 [23:29:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10468839 (10phaultfinder) [23:30:30] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1251:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1251 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:39:42] (03PS6) 10Zabe: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) [23:39:46] (03PS1) 10Clare Ming: Enable the text experiment on testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112105 (https://phabricator.wikimedia.org/T373715) [23:43:15] jouncebot: nowandnext [23:43:15] No deployments scheduled for the next 7 hour(s) and 16 minute(s) [23:43:15] In 7 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250117T0700) [23:43:21] (03CR) 10Zabe: [C:03+2] Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [23:44:10] (03Merged) 10jenkins-bot: Initial configurations for arbcom_zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112102 (https://phabricator.wikimedia.org/T380119) (owner: 10Zabe) [23:45:47] !log zabe@deploy2002 Started scap sync-world: Creating arbcom_zhwiki (T380119) [23:45:51] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119 [23:50:20] !log zabe@deploy2002 zabe: Creating arbcom_zhwiki (T380119) synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:50:43] (03PS1) 10Zabe: Update composer.lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112111 [23:51:18] !log zabe@deploy2002 zabe: Continuing with sync [23:52:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1251:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1251 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:58:30] !log zabe@deploy2002 Finished scap sync-world: Creating arbcom_zhwiki (T380119) (duration: 12m 43s) [23:58:34] T380119: Create arbcom-zh wiki - https://phabricator.wikimedia.org/T380119