[00:02:43] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1131861|maintenance: Add support for unlocking accounts in LockUser.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:12:25] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:28:54] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131866 (https://phabricator.wikimedia.org/T381544) [00:31:00] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131866 (https://phabricator.wikimedia.org/T381544) (owner: 10DDesouza) [00:32:40] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131866 (https://phabricator.wikimedia.org/T381544) (owner: 10DDesouza) [00:35:19] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:35:34] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:35:35] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:35:55] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:35:56] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:36:13] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:37:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131870 [00:38:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131870 (owner: 10TrainBranchBot) [00:44:48] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10685360 (10Krinkle) >>! In T389734#10676730, @Krinkle wrote: > […] > Logstash shows the URLs... [00:45:57] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [00:47:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:47:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:47:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:48:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:48:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:48:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:48:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:49:47] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T390254 (10phaultfinder) 03NEW [00:50:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:50:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:50:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:50:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131870 (owner: 10TrainBranchBot) [00:50:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:51:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:51:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:51:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:51:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:52:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:52:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:52:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:52:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:53:06] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131861|maintenance: Add support for unlocking accounts in LockUser.php]] (duration: 54m 51s) [00:55:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2291.codfw.wmnet with OS bookworm [00:55:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2291.codfw.wmnet with... [00:55:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2303.codfw.wmnet with OS bookworm [00:55:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2303.codfw.wmnet with... [00:55:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2304.codfw.wmnet with OS bookworm [00:55:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685376 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2304.codfw.wmnet with... [00:57:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:06:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2291.codfw.wmnet with reason: host reimage [01:06:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2303.codfw.wmnet with reason: host reimage [01:07:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2304.codfw.wmnet with reason: host reimage [01:09:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131871 [01:09:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131871 (owner: 10TrainBranchBot) [01:10:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2291.codfw.wmnet with reason: host reimage [01:13:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2303.codfw.wmnet with reason: host reimage [01:17:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2304.codfw.wmnet with reason: host reimage [01:25:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:25:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:25:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2291.codfw.wmnet with OS bookworm [01:25:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685417 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2291.codfw.wmnet with OS... [01:28:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:29:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:29:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2303.codfw.wmnet with OS bookworm [01:29:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2303.codfw.wmnet with OS... [01:32:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:33:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:33:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2304.codfw.wmnet with OS bookworm [01:33:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2304.codfw.wmnet with OS... [01:37:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2305.codfw.wmnet with OS bookworm [01:38:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2305.codfw.wmnet with... [01:38:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2306.codfw.wmnet with OS bookworm [01:38:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685424 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2306.codfw.wmnet with... [01:38:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2307.codfw.wmnet with OS bookworm [01:39:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685425 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2307.codfw.wmnet with... [01:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685426 (10phaultfinder) [01:49:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2305.codfw.wmnet with reason: host reimage [01:50:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2306.codfw.wmnet with reason: host reimage [01:50:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2307.codfw.wmnet with reason: host reimage [01:53:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2305.codfw.wmnet with reason: host reimage [01:56:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2307.codfw.wmnet with reason: host reimage [02:00:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131871 (owner: 10TrainBranchBot) [02:00:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2306.codfw.wmnet with reason: host reimage [02:09:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:10:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:10:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2305.codfw.wmnet with OS bookworm [02:11:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2305.codfw.wmnet with OS... [02:11:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:12:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2307.codfw.wmnet with OS bookworm [02:12:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685429 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2307.codfw.wmnet with OS... [02:15:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:16:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:16:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2306.codfw.wmnet with OS bookworm [02:16:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2306.codfw.wmnet with OS... [02:18:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2308.codfw.wmnet with OS bookworm [02:18:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2308.codfw.wmnet with... [02:19:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2309.codfw.wmnet with OS bookworm [02:19:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2309.codfw.wmnet with... [02:19:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2310.codfw.wmnet with OS bookworm [02:19:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685434 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2310.codfw.wmnet with... [02:30:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2308.codfw.wmnet with reason: host reimage [02:30:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2309.codfw.wmnet with reason: host reimage [02:30:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2310.codfw.wmnet with reason: host reimage [02:32:59] 10ops-magru: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390258 (10phaultfinder) 03NEW [02:33:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2308.codfw.wmnet with reason: host reimage [02:37:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2309.codfw.wmnet with reason: host reimage [02:39:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2310.codfw.wmnet with reason: host reimage [02:48:50] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:52:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:52:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2308.codfw.wmnet with OS bookworm [02:52:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2308.codfw.wmnet with OS... [02:52:15] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:52:25] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:54:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:57:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:00:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:00:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2309.codfw.wmnet with OS bookworm [03:00:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2309.codfw.wmnet with OS... [03:01:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:01:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2310.codfw.wmnet with OS bookworm [03:01:25] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2310.codfw.wmnet with OS... [03:01:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10685460 (10Jhancock.wm) [03:19:27] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:25] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:30:02] 10ops-eqiad, 06SRE, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389992#10685487 (10phaultfinder) [04:42:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:02:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:470:0:1c0::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685512 (10phaultfinder) [05:17:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:470:0:1c0::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0600) [06:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685526 (10phaultfinder) [06:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685529 (10phaultfinder) [06:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685547 (10phaultfinder) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0700) [07:17:58] (03Abandoned) 10Muehlenhoff: apt::package_from_bpo: Fail if used on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1130084 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [07:19:27] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:20:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685553 (10phaultfinder) [07:20:50] (03PS1) 10Marostegui: check_depooled: Remove x2, add ms[123] [software] - 10https://gerrit.wikimedia.org/r/1131888 (https://phabricator.wikimedia.org/T387332) [07:22:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2243.codfw.wmnet with reason: Maintenance [07:27:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:30:42] (03CR) 10Marostegui: [C:03+2] check_depooled: Remove x2, add ms[123] [software] - 10https://gerrit.wikimedia.org/r/1131888 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:31:13] (03Merged) 10jenkins-bot: check_depooled: Remove x2, add ms[123] [software] - 10https://gerrit.wikimedia.org/r/1131888 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:34:38] (03CR) 10Slyngshede: [V:03+1 C:03+2] D:apereo_cas::service do not excluded unfiltered attributes [puppet] - 10https://gerrit.wikimedia.org/r/1131730 (owner: 10Slyngshede) [07:35:00] (03PS9) 10Filippo Giunchedi: search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 [07:35:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10685569 (10Marostegui) I've rebooted the host and now the RAID is being rebuilt. ` root@db2243:/home/marostegui# ./storcli64 /c0 /e252 /s4 show CLI Version = 007.3205.0000.0000... [07:35:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685570 (10phaultfinder) [07:35:53] (03CR) 10Filippo Giunchedi: "> Per Filippo: The queries sublists are useful to a certain use case, so I propose we remove the trimmed version and instead under a flag " [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [07:39:27] FIRING: [4x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1121638 (owner: 10Majavah) [07:40:53] (03CR) 10Muehlenhoff: "Ack. I've +1d your patch, will abandon this one when merged." [puppet] - 10https://gerrit.wikimedia.org/r/1115316 (owner: 10Muehlenhoff) [07:41:37] (03CR) 10Muehlenhoff: "The underlying firewall rule was dropped two months ago, safe to remove for good." [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff) [07:42:06] (03CR) 10Muehlenhoff: [C:03+2] Remove rsync from archiva [puppet] - 10https://gerrit.wikimedia.org/r/1112228 (https://phabricator.wikimedia.org/T367315) (owner: 10Muehlenhoff) [07:48:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [07:53:08] (03PS1) 10Muehlenhoff: Create insetup role for SRE o11y with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131930 (https://phabricator.wikimedia.org/T389825) [07:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685582 (10phaultfinder) [07:57:28] (03CR) 10Majavah: [C:03+2] P:wmcs: Drop unused postgres class [puppet] - 10https://gerrit.wikimedia.org/r/1121638 (owner: 10Majavah) [08:01:49] (03Abandoned) 10Muehlenhoff: wmcs::services::postgres::primary: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1115316 (owner: 10Muehlenhoff) [08:02:39] (03PS2) 10Muehlenhoff: Remove use of openstack-db repository component [puppet] - 10https://gerrit.wikimedia.org/r/1117838 [08:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685590 (10phaultfinder) [08:04:53] (03PS1) 10Muehlenhoff: archiva: Remove rsync even harder [puppet] - 10https://gerrit.wikimedia.org/r/1131932 [08:05:39] (03CR) 10Majavah: [C:03+1] Remove use of openstack-db repository component [puppet] - 10https://gerrit.wikimedia.org/r/1117838 (owner: 10Muehlenhoff) [08:09:55] (03CR) 10Muehlenhoff: [C:03+2] archiva: Remove rsync even harder [puppet] - 10https://gerrit.wikimedia.org/r/1131932 (owner: 10Muehlenhoff) [08:11:03] (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade CAS to version 7.1.4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1111636 (owner: 10Slyngshede) [08:19:41] (03CR) 10Federico Ceratto: [C:03+1] clone.py: Add logic to handle hosts unknown to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1131711 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [08:19:42] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Add logic to handle hosts unknown to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1131711 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [08:21:32] (03CR) 10Marostegui: [C:03+1] clone.py: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [08:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685598 (10phaultfinder) [08:27:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10685613 (10elukey) @Marostegui in the other similar hot-swap testing task we found out that only a controller restart would trigger the disk to be recognized again as JBOD, and... [08:34:53] (03CR) 10Filippo Giunchedi: [C:03+1] Create insetup role for SRE o11y with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131930 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [08:35:02] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10685624 (10Marostegui) >>! In T388684#10685613, @elukey wrote: > @Marostegui in the other similar hot-swap testing task we found out that only a controller restart would trigger... [08:36:27] (03CR) 10DCausse: "Should be ready to go" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:37:06] (03CR) 10Daniel Kinzler: [C:03+1] "let's do it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [08:37:10] !log elukey@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on centrallog2002.codfw.wmnet with reason: Test stopping benthos webrequest-live [08:38:23] !log stop benthos-webrequest_live on centrallog2002.codfw.wmnet to test handling load/traffic with one instance - T390029 [08:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:28] T390029: Migrate Benthos `webrequest_sampled_live` to feed from HAProxy data - https://phabricator.wikimedia.org/T390029 [08:39:47] !log bounce mtail on centrallog1002 - stuck on cpu [08:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:58] (03PS1) 10Muehlenhoff: Add cumin1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1131933 (https://phabricator.wikimedia.org/T389380) [08:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685634 (10phaultfinder) [08:47:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:48:54] (03CR) 10Federico Ceratto: [C:03+1] clone.py: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [08:48:55] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [08:51:37] (03Abandoned) 10Elukey: benthos: bump webrequest_live instance's thread to 48 [puppet] - 10https://gerrit.wikimedia.org/r/1131747 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [08:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685654 (10phaultfinder) [08:55:49] (03PS1) 10Muehlenhoff: No longer refer to a Phabricator task in the access request caption [software/bitu] - 10https://gerrit.wikimedia.org/r/1131934 [09:06:11] (03PS1) 10Filippo Giunchedi: pontoon: write config.yaml at stack creation time [puppet] - 10https://gerrit.wikimedia.org/r/1131935 [09:06:11] (03PS1) 10Filippo Giunchedi: git-sync-upstream: set user name [puppet] - 10https://gerrit.wikimedia.org/r/1131936 [09:06:11] (03PS1) 10Filippo Giunchedi: pontoon: retry ssh commands [puppet] - 10https://gerrit.wikimedia.org/r/1131937 [09:06:12] (03PS1) 10Filippo Giunchedi: pontoon: misc cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1131938 [09:07:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:11:08] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131934 (owner: 10Muehlenhoff) [09:12:10] (03CR) 10Marostegui: [C:03+1] clone.py: Add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1131750 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [09:16:10] (03CR) 10Filippo Giunchedi: [C:03+1] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:18:00] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5169/co" [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [09:19:27] FIRING: [4x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:34] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cp4047.ulsfo.wmnet with reason: HW errors [09:20:41] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10685760 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=41518621-9d6e-490d-85ad-6878a0e78166) set by fabfur@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason:... [09:21:32] !log restarting blazegraph on wdqs1020 (deadlocked) [09:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:02] (03CR) 10Volans: "I'm not familiar with this script, so ignore my comment if I'm missing context." [puppet] - 10https://gerrit.wikimedia.org/r/1131937 (owner: 10Filippo Giunchedi) [09:23:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:24:27] RESOLVED: [4x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:44] (03CR) 10Muehlenhoff: [C:03+2] No longer refer to a Phabricator task in the access request caption [software/bitu] - 10https://gerrit.wikimedia.org/r/1131934 (owner: 10Muehlenhoff) [09:27:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1020:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:33:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:33:59] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924#10685806 (10MoritzMuehlenhoff) @acooper @Scott_French FYI, the ticket field is no longer displayed on idm-test.wikimedia.org and the change will also soon end up in the next Bitu release/p... [09:36:39] ElevatedMaxLagWDQS should not have fired, looking [09:37:09] !log repooling wdqs1020 [09:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1020:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:43:28] (03PS1) 10DCausse: wdqs: fix monitoring user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1131940 [09:43:50] (03PS1) 10Muehlenhoff: Fix the UID for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1131941 (https://phabricator.wikimedia.org/T389817) [09:48:59] (03CR) 10Muehlenhoff: [C:03+2] Fix the UID for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1131941 (https://phabricator.wikimedia.org/T389817) (owner: 10Muehlenhoff) [09:50:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10685812 (10MoritzMuehlenhoff) >>! In T389817#10683990, @Scott_French wrote: > I also see that the `uid` in that patch is not LDAP `uidNumber` for conwumelu-ctr@... [09:50:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685813 (10phaultfinder) [09:52:54] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:53:05] !log benthos-webrequest_live back working with two instances [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:54] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:54:02] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:54:34] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:54:41] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:55:19] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:55:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:55:47] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:56:18] (03Abandoned) 10Elukey: mapnik: upgrade to upstream 4.0.6 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131388 (https://phabricator.wikimedia.org/T389776) (owner: 10Elukey) [09:59:21] (03CR) 10Filippo Giunchedi: pontoon: retry ssh commands (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131937 (owner: 10Filippo Giunchedi) [10:01:42] (03PS2) 10Filippo Giunchedi: hieradata: move profile::acme_chief::certificates to profile [puppet] - 10https://gerrit.wikimedia.org/r/1131270 [10:06:53] (03CR) 10Jelto: [V:03+1 C:03+2] deployment_server: add puppetdb rsync to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [10:09:38] (03CR) 10FNegri: "What issue are you seeing? I know Andrew did some work on this so I'd wait for him to +1." [puppet] - 10https://gerrit.wikimedia.org/r/1131936 (owner: 10Filippo Giunchedi) [10:10:49] (03PS1) 10Elukey: knative: update build control file to reflect reality [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131945 [10:12:08] (03CR) 10Ilias Sarantopoulos: [C:03+1] knative: update build control file to reflect reality [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131945 (owner: 10Elukey) [10:12:45] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: write config.yaml at stack creation time [puppet] - 10https://gerrit.wikimedia.org/r/1131935 (owner: 10Filippo Giunchedi) [10:12:52] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: misc cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1131938 (owner: 10Filippo Giunchedi) [10:14:24] (03PS2) 10Filippo Giunchedi: pontoon: retry ssh commands [puppet] - 10https://gerrit.wikimedia.org/r/1131937 [10:14:35] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: retry ssh commands [puppet] - 10https://gerrit.wikimedia.org/r/1131937 (owner: 10Filippo Giunchedi) [10:14:52] (03PS2) 10Filippo Giunchedi: pontoon: misc cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1131938 [10:14:59] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: misc cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1131938 (owner: 10Filippo Giunchedi) [10:20:06] (03CR) 10Elukey: [V:03+2 C:03+2] knative: update build control file to reflect reality [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131945 (owner: 10Elukey) [10:20:48] (03CR) 10Filippo Giunchedi: "Not an issue per-se, more of a cosmetic change to have git-sync-upstream mentioned as the author in the commits as opposed to "Puppet conf" [puppet] - 10https://gerrit.wikimedia.org/r/1131936 (owner: 10Filippo Giunchedi) [10:23:02] (03PS2) 10Filippo Giunchedi: git-sync-upstream: set user name [puppet] - 10https://gerrit.wikimedia.org/r/1131936 [10:23:51] (03PS2) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:24:14] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:24:22] (03PS3) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:24:38] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:24:57] (03CR) 10MVernon: gitlab: rename thanos object storage parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:25:10] (03PS2) 10Elukey: Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 [10:25:10] (03PS4) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:25:25] (03CR) 10CI reject: [V:04-1] Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [10:25:27] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:27:37] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10685898 (10BTullis) p:05Triage→03Medium [10:29:34] (03PS1) 10Filippo Giunchedi: pontoon: default to not force host enroll [puppet] - 10https://gerrit.wikimedia.org/r/1131946 [10:31:24] (03PS1) 10Btullis: Exclude an-worker group1 for hard drive replacement [puppet] - 10https://gerrit.wikimedia.org/r/1131947 (https://phabricator.wikimedia.org/T390168) [10:31:39] (03CR) 10Jelto: [V:03+1] gitlab: rename thanos object storage parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:32:25] (03PS3) 10Elukey: Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 [10:32:25] (03PS5) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:32:39] (03CR) 10CI reject: [V:04-1] Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [10:32:43] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:33:57] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5170/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131947 (https://phabricator.wikimedia.org/T390168) (owner: 10Btullis) [10:34:15] (03PS1) 10Elukey: WIP - test CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1131948 [10:34:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10685911 (10BTullis) p:05Triage→03Medium [10:34:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10685912 (10BTullis) p:05Triage→03Medium [10:34:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10685913 (10BTullis) p:05Triage→03Medium [10:34:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10685915 (10BTullis) p:05Triage→03Medium [10:34:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10685914 (10BTullis) p:05Triage→03Medium [10:34:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10685916 (10BTullis) p:05Triage→03Medium [10:34:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10685918 (10BTullis) p:05Triage→03Medium [10:35:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10685917 (10BTullis) p:05Triage→03Medium [10:35:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10685919 (10BTullis) p:05Triage→03Medium [10:35:41] !log jelto@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:36:27] (03CR) 10Btullis: [V:03+1 C:03+2] Exclude an-worker group1 for hard drive replacement [puppet] - 10https://gerrit.wikimedia.org/r/1131947 (https://phabricator.wikimedia.org/T390168) (owner: 10Btullis) [10:36:33] !log jelto@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:38:57] !log jelto@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:39:28] !log jelto@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:41:03] !log apply admin_ng external-services to add puppetdb endpoints - T350794 [10:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:08] T350794: move os-reports.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T350794 [10:41:31] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:41:48] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:43:13] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [10:43:44] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:50:08] (03PS4) 10DCausse: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) [10:53:11] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10685944 (10BTullis) [10:56:30] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10685947 (10BTullis) Confirmed from the HDFS [[https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Ha... [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0700) [11:00:05] jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T1100). [11:02:04] (03CR) 10Elukey: "I am still not getting why sdist-make runs on python3.7, when the pyenvs specified are 3.9+." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [11:05:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685954 (10phaultfinder) [11:10:47] (03PS1) 10Jelto: miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) [11:11:48] !log jnuche@deploy1003 Installing scap version "4.147.0" for 2 host(s) [11:12:50] !log jnuche@deploy1003 Installation of scap version "4.147.0" completed for 2 hosts [11:19:30] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:19:49] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:19:59] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:20:54] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:20:55] (03CR) 10Sergio Gimeno: [C:03+1] "I thought his change depends on the flag removal from Ie728f6be159e8e5747560cf2fdc263c39ecc60e5, but since the extension.json defaults are" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime) [11:22:10] (03PS1) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [11:24:24] (03CR) 10Jelto: [C:03+1] "The correct environments are `-e aux-k8s-eqiad` and `-e aux-k8s-codfw` btw." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [11:27:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:28:35] (03PS1) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [11:29:51] (03CR) 10CI reject: [V:04-1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [11:30:20] (03CR) 10CI reject: [V:04-1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [11:36:07] (03CR) 10FNegri: [C:03+1] "Ah right I see, I misread this as being about another "root vs gitpuppet" problem, but it's only a cosmetic change in the git commit. I'm " [puppet] - 10https://gerrit.wikimedia.org/r/1131936 (owner: 10Filippo Giunchedi) [11:49:57] (03CR) 10Elukey: "This works with skipsdist=True in tox.ini, it correctly uses the right pyenv and everything works. Why sdists uses another pyenv version (" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [11:54:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10686174 (10phaultfinder) [11:58:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:02:59] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@a233b0d]: bump SEAL to v0.5.0 [12:03:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:04:03] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@a233b0d]: bump SEAL to v0.5.0 (duration: 01m 14s) [12:07:13] (03CR) 10Filippo Giunchedi: [C:03+2] git-sync-upstream: set user name [puppet] - 10https://gerrit.wikimedia.org/r/1131936 (owner: 10Filippo Giunchedi) [12:07:19] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: default to not force host enroll [puppet] - 10https://gerrit.wikimedia.org/r/1131946 (owner: 10Filippo Giunchedi) [12:22:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [12:27:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [12:29:25] (03PS1) 10Filippo Giunchedi: team-o11y: test disk space prediction for logstash/opensearch [alerts] - 10https://gerrit.wikimedia.org/r/1131964 [12:32:27] FIRING: [2x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:11] (03PS1) 10Btullis: Add a cleanup timer for old dumps webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/1131965 (https://phabricator.wikimedia.org/T390123) [12:34:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5171/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131965 (https://phabricator.wikimedia.org/T390123) (owner: 10Btullis) [12:37:27] RESOLVED: [2x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:47:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [12:52:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:57:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [13:02:08] (03PS1) 10Hnowlan: api-gateway: add networkpolicy entry for rdb1011 AAAA record [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131967 [13:12:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:12:31] (03CR) 10DCausse: [C:03+1] "lgtm (please feel free to merge, I don't have +2 perms on this repo)" [software/elasticsearch/madvise] - 10https://gerrit.wikimedia.org/r/1131796 (https://phabricator.wikimedia.org/T390118) (owner: 10Ebernhardson) [13:14:08] (03PS2) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [13:17:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:17:45] (03PS1) 10Kosta Harlan: IPReputation: Disable CAPTCHA on Special:UserLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131968 (https://phabricator.wikimedia.org/T379178) [13:19:56] jouncebot: nowandnext [13:19:56] For the next 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0700) [13:19:56] In 17 hour(s) and 40 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250329T0700) [13:20:08] Reedy and I need to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131968 [13:22:13] let me restart the CI Jenkins :) [13:22:23] I am waiting for a job in gate and submit to complete [13:22:33] (03PS2) 10Jelto: miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) [13:22:39] hashar: thanks [13:22:41] (03PS3) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [13:22:45] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:22:56] kostajh: Is that, reedy needs to deploy it? ;) [13:23:16] I can do it, but I am also going to interview someone in ~35 minutes so probably best if it's not me [13:23:53] !log Restarted CI Jenkins [13:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:01] (03PS4) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [13:28:42] (03CR) 10CI reject: [V:04-1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [13:32:40] hashar: are you able to deploy it? [13:33:00] (03PS5) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [13:33:20] I can deploy it, if hashar tells me when java has restarted [13:33:31] oh CI is back yes [13:33:53] I am not sure what the config poathc is doing then it looks straightforward :) [13:34:01] I am around if needed [13:34:01] (03CR) 10Reedy: [C:03+2] IPReputation: Disable CAPTCHA on Special:UserLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131968 (https://phabricator.wikimedia.org/T379178) (owner: 10Kosta Harlan) [13:34:51] (03Merged) 10jenkins-bot: IPReputation: Disable CAPTCHA on Special:UserLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131968 (https://phabricator.wikimedia.org/T379178) (owner: 10Kosta Harlan) [13:39:04] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10686400 (10phaultfinder) [13:39:15] Reedy: thanks [13:46:04] (03CR) 10CI reject: [V:04-1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [13:48:48] !log reedy@deploy1003 Synchronized wmf-config/CommonSettings.php: Disable CAPTCHA on Special:UserLogin (duration: 11m 52s) [13:48:57] (03PS6) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [13:50:45] (03CR) 10CI reject: [V:04-1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [13:52:48] Reedy: I can test it when it's on mwdebug [13:54:54] (03CR) 10Jelto: "@kamila do you have an idea why linting fails for `aux-k8s-codfw`? `aux-k8s-eqiad` is fine but as soon as I add codfw, linting fails. The " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [14:13:25] (03CR) 10Clément Goubert: [C:03+1] api-gateway: add networkpolicy entry for rdb1011 AAAA record [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131967 (owner: 10Hnowlan) [14:16:13] (03Abandoned) 10Elukey: WIP - test CI [cookbooks] - 10https://gerrit.wikimedia.org/r/1131948 (owner: 10Elukey) [14:30:52] (03CR) 10Kamila Součková: "@jwodstrcil@wikimedia.org Looks like I left miscweb out of the hotfix for T388969, sorry about that. I hadn't realised there were more cha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [14:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10686597 (10phaultfinder) [14:36:08] (03PS1) 10Ahmon Dancy: P:idp Limit groups sent from CAS to Spiderpig (redo) [puppet] - 10https://gerrit.wikimedia.org/r/1131975 (https://phabricator.wikimedia.org/T389869) [14:36:45] (03CR) 10Ahmon Dancy: "Retry" [puppet] - 10https://gerrit.wikimedia.org/r/1131975 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [14:40:00] (03PS1) 10SCherukuwada: Add hibp txt verification record. [dns] - 10https://gerrit.wikimedia.org/r/1131976 [14:40:31] (03CR) 10CI reject: [V:04-1] Add hibp txt verification record. [dns] - 10https://gerrit.wikimedia.org/r/1131976 (owner: 10SCherukuwada) [14:41:58] (03PS2) 10Filippo Giunchedi: team-o11y: test disk space prediction for logstash/opensearch [alerts] - 10https://gerrit.wikimedia.org/r/1131964 [14:42:12] (03PS2) 10SCherukuwada: Add DNS verification records for HaveIBeenPwned.com. [dns] - 10https://gerrit.wikimedia.org/r/1131976 (https://phabricator.wikimedia.org/T389727) [14:42:47] (03CR) 10CI reject: [V:04-1] Add DNS verification records for HaveIBeenPwned.com. [dns] - 10https://gerrit.wikimedia.org/r/1131976 (https://phabricator.wikimedia.org/T389727) (owner: 10SCherukuwada) [14:44:48] (03CR) 10Kamila Součková: "Sorry, that does not appear to be true, I'm not quite sure what's going on. Looking into it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [14:45:52] Reedy: is it live? [14:46:28] (03PS1) 10SCherukuwada: Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 [14:48:22] (03PS1) 10Scott French: admin: clarify UID requirements for new production users [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) [14:51:33] (03CR) 10Fabfur: [C:03+1] Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada) [14:53:37] (03CR) 10Muehlenhoff: "Thanks for updating the notes, some suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) (owner: 10Scott French) [14:58:10] !log fab@deploy1003 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [14:58:45] !log fab@deploy1003 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 39s) [15:00:35] (03PS1) 10SCherukuwada: Add DNS verification records for HaveIBeenPwned.com. [dns] - 10https://gerrit.wikimedia.org/r/1131981 (https://phabricator.wikimedia.org/T389727) [15:00:50] 06SRE: docker-registry.wikimedia.org is storing a bad blob - https://phabricator.wikimedia.org/T390251#10686637 (10taavi) [15:01:48] (03CR) 10Fabfur: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1131981 (https://phabricator.wikimedia.org/T389727) (owner: 10SCherukuwada) [15:04:04] !log fabfur@dns1004 START - running authdns-update [15:06:08] (03PS2) 10Scott French: admin: clarify UID requirements for new production users [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) [15:06:13] !log fabfur@dns1004 END - running authdns-update [15:06:38] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) (owner: 10Scott French) [15:07:04] (03CR) 10Fabfur: [C:03+2] Add DNS verification records for HaveIBeenPwned.com. [dns] - 10https://gerrit.wikimedia.org/r/1131981 (https://phabricator.wikimedia.org/T389727) (owner: 10SCherukuwada) [15:07:07] kostajh: Has been For ~80 mins [15:07:13] !log fabfur@dns1004 START - running authdns-update [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:24] !log fabfur@dns1004 END - running authdns-update [15:10:35] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) (owner: 10Scott French) [15:13:20] (03PS1) 10SCherukuwada: Remove the HaveIBeenPwned TXT verification entry. [dns] - 10https://gerrit.wikimedia.org/r/1131982 [15:13:36] (03PS2) 10SCherukuwada: Remove the HaveIBeenPwned TXT verification entry. [dns] - 10https://gerrit.wikimedia.org/r/1131982 [15:13:57] (03CR) 10Hnowlan: [C:03+2] api-gateway: add networkpolicy entry for rdb1011 AAAA record [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131967 (owner: 10Hnowlan) [15:14:22] (03CR) 10Fabfur: [C:03+2] Remove the HaveIBeenPwned TXT verification entry. [dns] - 10https://gerrit.wikimedia.org/r/1131982 (owner: 10SCherukuwada) [15:14:50] !log fabfur@dns1004 START - running authdns-update [15:15:25] (03Merged) 10jenkins-bot: api-gateway: add networkpolicy entry for rdb1011 AAAA record [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131967 (owner: 10Hnowlan) [15:17:02] !log fabfur@dns1004 END - running authdns-update [15:18:11] !log uploaded wmf-laptop 1.0.1 to apt.wikimedia.org [15:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:48] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:19:50] (03PS1) 10Fabfur: HIBP verification code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131983 (https://phabricator.wikimedia.org/T389727) [15:19:54] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:22:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:22:34] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:23:03] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:23:14] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:23:18] (03CR) 10Kamila Součková: "I believe some bits of CI are missing k8s-aux-codfw, I will see about adding it and rebasing this on top." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [15:24:51] (03CR) 10Giuseppe Lavagetto: [C:03+1] HIBP verification code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131983 (https://phabricator.wikimedia.org/T389727) (owner: 10Fabfur) [15:26:43] <_joe_> jouncebot: nowandnext [15:26:43] For the next 15 hour(s) and 33 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0700) [15:26:43] In 15 hour(s) and 33 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250329T0700) [15:26:51] (03CR) 10SCherukuwada: [C:03+2] HIBP verification code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131983 (https://phabricator.wikimedia.org/T389727) (owner: 10Fabfur) [15:27:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:27:44] (03Merged) 10jenkins-bot: HIBP verification code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131983 (https://phabricator.wikimedia.org/T389727) (owner: 10Fabfur) [15:28:38] !log oblivian@deploy1003 Started scap sync-world: Backport for [[gerrit:1131983|HIBP verification code (T389727)]] [15:29:29] !log bking@apt1002 publish wmf-opensearch-search-plugins-1.3.20-3 to component/opensearch13 bullseye-wikimedia T390100 [15:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [15:31:26] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) (owner: 10Scott French) [15:31:28] (03CR) 10Scott French: [C:03+2] admin: clarify UID requirements for new production users [puppet] - 10https://gerrit.wikimedia.org/r/1131979 (https://phabricator.wikimedia.org/T389817) (owner: 10Scott French) [15:33:40] !log oblivian@deploy1003 oblivian, fabfur: Backport for [[gerrit:1131983|HIBP verification code (T389727)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:34:07] !log oblivian@deploy1003 oblivian, fabfur: Continuing with sync [15:37:06] (03PS2) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [15:37:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:22] (03CR) 10Federico Ceratto: "Basic test done with dry-run. I'll finish the functional test." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [15:40:52] !log oblivian@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131983|HIBP verification code (T389727)]] (duration: 12m 14s) [15:41:52] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186#10686744 (10Scott_French) 05Open→03Resolved @AStein-WMF - I'm going to re-resolve this. Please re-open if you have iss... [15:42:03] (03CR) 10Elukey: "I printed the environment right when sdist runs, there are the PYENV vars:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [15:43:48] (03CR) 10CI reject: [V:04-1] sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [15:45:30] (03PS3) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) [15:46:49] (03PS1) 10Nik Gkountas: SpecialTranslationTargetLanguages: Use cxserver-supported language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131990 (https://phabricator.wikimedia.org/T390300) [16:01:39] (03PS1) 10Hashar: Fix removal of Gerrit json prefix [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 [16:13:47] (03PS4) 10Hashar: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [16:14:15] (03CR) 10Ebernhardson: "updated per review, but also realized this is no longer the correct repo. Will abandon this patch and continue at https://gitlab.wikimedia" [software/elasticsearch/madvise] - 10https://gerrit.wikimedia.org/r/1131796 (https://phabricator.wikimedia.org/T390118) (owner: 10Ebernhardson) [16:14:26] (03Abandoned) 10Ebernhardson: Accept data path as a cli arg [software/elasticsearch/madvise] - 10https://gerrit.wikimedia.org/r/1131796 (https://phabricator.wikimedia.org/T390118) (owner: 10Ebernhardson) [16:24:29] (03CR) 10Jdlrobson: "I think I'm doing something wrong in testing this but i'm getting:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [16:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10686872 (10phaultfinder) [16:27:12] 06SRE: Trouble reaching Microsoft email domains - https://phabricator.wikimedia.org/T390307 (10nisrael) 03NEW [16:36:43] (03CR) 10Ebernhardson: [C:03+1] "looks like we are ready to deploy this next week" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:37:16] (03PS1) 10Reedy: OATHAuth: Mark interface-admin as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131997 [16:44:32] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [16:44:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [16:44:37] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [16:47:11] (03CR) 10Ladsgroup: [C:03+1] OATHAuth: Mark interface-admin as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131997 (owner: 10Reedy) [16:47:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [16:57:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:57:10] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10687015 (10Marostegui) The RAID got back to Optimal. ` VD LIST : ======= -------------------------------------------------------------- DG/VD TYPE State Access Consist Cache... [17:12:31] 06SRE: docker-registry.wikimedia.org is storing a bad blob - https://phabricator.wikimedia.org/T390251#10687068 (10Scott_French) Adding more complete logs today from the codfw (primary) registry hosts, before they rotate out of the journal. This is the result of searching for logs involving the `e7b2287766dc2a9... [17:17:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:33:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131997 (owner: 10Reedy) [17:36:48] (03Merged) 10jenkins-bot: OATHAuth: Mark interface-admin as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131997 (owner: 10Reedy) [17:37:00] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131997|OATHAuth: Mark interface-admin as requiring 2FA]] [17:41:43] !log ladsgroup@deploy1003 ladsgroup, reedy: Backport for [[gerrit:1131997|OATHAuth: Mark interface-admin as requiring 2FA]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:43:41] !log ladsgroup@deploy1003 ladsgroup, reedy: Continuing with sync [17:50:44] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131997|OATHAuth: Mark interface-admin as requiring 2FA]] (duration: 13m 43s) [18:30:52] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10687257 (10Andrew) I've worked on this a bit more; here's what we have today: **The good** * http://ec2-52-23-161-9.compute-1.amazonaws.com * Tha... [18:37:59] (03CR) 10Federico Ceratto: [C:03+1] clone.py: Add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1131750 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [18:38:01] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1131750 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [18:45:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131483 (https://phabricator.wikimedia.org/T387155) (owner: 10Jdlrobson) [18:45:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131484 (https://phabricator.wikimedia.org/T390112) (owner: 10Jdlrobson) [18:49:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1128779 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [19:03:59] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319 (10RobH) 03NEW [19:04:08] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10687359 (10RobH) [19:05:52] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10687368 (10RobH) @MoritzMuehlenhoff, Please advise if these hosts are under user load or not currently pooled? I ask because if they are under load, and this is our first ex... [19:07:28] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320 (10RobH) 03NEW [19:07:35] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320#10687391 (10RobH) [19:08:02] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320#10687396 (10RobH) @MoritzMuehlenhoff, Please advise if these hosts are under user load or not currently pooled? I ask because if they are under load, and this is our first... [19:12:22] (03PS5) 10Hashar: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [19:12:22] (03PS1) 10Hashar: Add a basic test for user_block in LDAP [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 [19:14:10] (03PS2) 10Hashar: Fix handling of status code in Gerrit integration [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 [19:14:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131990 (https://phabricator.wikimedia.org/T390300) (owner: 10Nik Gkountas) [19:15:12] !log fab@deploy1003 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [19:15:54] !log fab@deploy1003 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 43s) [19:18:09] (03PS1) 10Gergő Tisza: Enable SUL3 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132020 (https://phabricator.wikimedia.org/T384220) [19:18:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132020 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [19:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:27:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2311.codfw.wmnet with OS bookworm [19:27:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2311.codfw.wmnet with... [19:27:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2312.codfw.wmnet with OS bookworm [19:28:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687480 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2312.codfw.wmnet with... [19:28:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2313.codfw.wmnet with OS bookworm [19:28:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2313.codfw.wmnet with... [19:28:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2314.codfw.wmnet with OS bookworm [19:28:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2314.codfw.wmnet with... [19:28:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2315.codfw.wmnet with OS bookworm [19:28:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687484 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2315.codfw.wmnet with... [19:31:41] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [19:31:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [19:31:45] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [19:31:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320#10687506 (10MoritzMuehlenhoff) You can do it anytime, these are not yet in production and when the new SSDs are in, they will be reimaged. [19:32:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10687507 (10MoritzMuehlenhoff) You can do it anytime, these are not yet in production and when the new SSDs are in, they will be reimaged. [19:36:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti105[34] - https://phabricator.wikimedia.org/T390319#10687515 (10RobH) [19:36:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320#10687516 (10RobH) [19:39:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2311.codfw.wmnet with reason: host reimage [19:39:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2312.codfw.wmnet with reason: host reimage [19:39:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2313.codfw.wmnet with reason: host reimage [19:39:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2314.codfw.wmnet with reason: host reimage [19:40:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2315.codfw.wmnet with reason: host reimage [19:42:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2311.codfw.wmnet with reason: host reimage [19:45:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2314.codfw.wmnet with reason: host reimage [19:48:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2315.codfw.wmnet with reason: host reimage [19:52:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2312.codfw.wmnet with reason: host reimage [19:56:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2313.codfw.wmnet with reason: host reimage [19:57:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:02:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:06:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:06:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2311.codfw.wmnet with OS bookworm [20:06:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:06:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2314.codfw.wmnet with OS bookworm [20:06:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687577 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2311.codfw.wmnet with OS... [20:06:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2314.codfw.wmnet with OS... [20:06:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:07:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:07:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2315.codfw.wmnet with OS bookworm [20:07:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2315.codfw.wmnet with OS... [20:09:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:09:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:09:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2312.codfw.wmnet with OS bookworm [20:09:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2312.codfw.wmnet with OS... [20:12:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:17:22] (03PS1) 10Bking: relforge: enable rack awareness [puppet] - 10https://gerrit.wikimedia.org/r/1132024 (https://phabricator.wikimedia.org/T383811) [20:17:52] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132024 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [20:19:56] (03CR) 10Bking: [C:03+2] relforge: enable rack awareness [puppet] - 10https://gerrit.wikimedia.org/r/1132024 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [20:20:20] (03CR) 10Bking: [C:03+2] "Self-merging, as this does not touch a production environment." [puppet] - 10https://gerrit.wikimedia.org/r/1132024 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [20:20:36] (03PS1) 10Kosta Harlan: Introduce configuration to deny logins from unknown systems [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) [20:21:24] (03CR) 10Máté Szabó: [C:03+1] Introduce configuration to deny logins from unknown systems [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [20:27:27] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#10687659 (10Ottomata) [20:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10687666 (10phaultfinder) [20:30:03] (03PS1) 10Máté Szabó: Configure LoginNotify deny functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) [20:32:59] (03CR) 10Gergő Tisza: Configure LoginNotify deny functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [20:34:11] (03PS2) 10Máté Szabó: Configure LoginNotify deny functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) [20:34:30] (03CR) 10Máté Szabó: Configure LoginNotify deny functionality (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [20:38:47] (03CR) 10Kosta Harlan: [C:03+1] Configure LoginNotify deny functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [20:38:57] jouncebot: nowandnext [20:38:57] For the next 10 hour(s) and 21 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0700) [20:38:58] In 10 hour(s) and 21 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250329T0700) [20:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10687703 (10phaultfinder) [20:40:20] (03PS1) 10Gergő Tisza: Add LoginNotify to disallowed local providers [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 [20:41:00] (03CR) 10Máté Szabó: [C:03+1] Add LoginNotify to disallowed local providers [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [20:47:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:47:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2313.codfw.wmnet with OS bookworm [20:47:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2313.codfw.wmnet with OS... [20:48:11] (03CR) 10Máté Szabó: "wmgMonologChannels sets LoginNotify to `info`, so no changes there should be needed to pick up logs from the PAP." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [20:48:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2316.codfw.wmnet with OS bookworm [20:48:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2316.codfw.wmnet with... [20:49:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2317.codfw.wmnet with OS bookworm [20:49:10] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2317.codfw.wmnet with... [20:49:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2318.codfw.wmnet with OS bookworm [20:49:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2318.codfw.wmnet with... [20:49:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2319.codfw.wmnet with OS bookworm [20:50:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687715 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2319.codfw.wmnet with... [20:50:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2320.codfw.wmnet with OS bookworm [20:50:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2320.codfw.wmnet with... [20:59:51] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10687730 (10BTullis) It's looking like about 12 hours to evacuate 3 Hadoop workers. {F58937048} That's not bad at all. [21:00:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2316.codfw.wmnet with reason: host reimage [21:00:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2317.codfw.wmnet with reason: host reimage [21:00:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2318.codfw.wmnet with reason: host reimage [21:01:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2319.codfw.wmnet with reason: host reimage [21:01:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2320.codfw.wmnet with reason: host reimage [21:02:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:03:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2316.codfw.wmnet with reason: host reimage [21:05:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2317.codfw.wmnet with reason: host reimage [21:08:04] Hey all - for the ongoing sec incident, we’d like to get a new feature for CentralAuth deployed. It will be default-disabled but we’d like it to be available to enable if we see any attack traffic over the weekend. [21:08:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2320.codfw.wmnet with reason: host reimage [21:08:54] thcipriani: wanna give sbassett the ok? ^ [21:09:46] main patch (LoginNotify): https://gerrit.wikimedia.org/r/1132025 [21:10:03] config patch: https://gerrit.wikimedia.org/r/1132027 [21:10:26] and maybe this CA change: https://gerrit.wikimedia.org/r/1132029 [21:12:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2319.codfw.wmnet with reason: host reimage [21:16:26] sbassett: Go ahead [21:16:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2318.codfw.wmnet with reason: host reimage [21:17:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [21:17:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [21:17:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:18:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:18:18] (03Merged) 10jenkins-bot: Configure LoginNotify deny functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132027 (https://phabricator.wikimedia.org/T390315) (owner: 10Máté Szabó) [21:20:04] (03CR) 10CI reject: [V:04-1] Add LoginNotify to disallowed local providers [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [21:20:23] (03CR) 10SBassett: "recheck" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:20:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:20:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2316.codfw.wmnet with OS bookworm [21:20:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:20:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2316.codfw.wmnet with OS... [21:20:40] (03CR) 10SBassett: "recheck" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [21:21:48] (03CR) 10CI reject: [V:04-1] Introduce configuration to deny logins from unknown systems [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:22:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:24:12] (03CR) 10SBassett: "recheck" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:25:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:25:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2317.codfw.wmnet with OS bookworm [21:25:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:25:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2317.codfw.wmnet with OS... [21:26:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:26:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2320.codfw.wmnet with OS bookworm [21:26:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2320.codfw.wmnet with OS... [21:27:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10687798 (10phaultfinder) [21:30:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:30:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2319.codfw.wmnet with OS bookworm [21:30:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2319.codfw.wmnet with OS... [21:32:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:36:31] Just FYI - we’re seeing some strange wmf.22 test failures for CentralAuth tests for quibble-php74-noselenium. If we can debug those quickly, I’d like to resume my aforementioned security deploy. [21:37:20] (03CR) 10Brennen Bearnes: "At a glance, these look like actual test assertion failures rather than CI flakiness..." [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:39:30] (03CR) 10Krinkle: [C:03+2] Add LoginNotify to disallowed local providers [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [21:40:45] (03Merged) 10jenkins-bot: Add LoginNotify to disallowed local providers [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132029 (owner: 10Gergő Tisza) [21:43:18] (03CR) 10Kosta Harlan: "recheck" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:44:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [21:44:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2318.codfw.wmnet with OS bookworm [21:44:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687817 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2318.codfw.wmnet with OS... [21:45:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2321.codfw.wmnet with... [21:45:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2322.codfw.wmnet with OS bookworm [21:45:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2322.codfw.wmnet with... [21:45:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2323.codfw.wmnet with OS bookworm [21:45:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2323.codfw.wmnet with... [21:45:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2324.codfw.wmnet with OS bookworm [21:46:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687823 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2324.codfw.wmnet with... [21:46:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2325.codfw.wmnet with OS bookworm [21:46:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2325.codfw.wmnet with... [21:46:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:53:56] (03CR) 10Jdlrobson: [C:03+1] Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia) [21:54:49] (03Merged) 10jenkins-bot: Introduce configuration to deny logins from unknown systems [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132025 (https://phabricator.wikimedia.org/T390315) (owner: 10Kosta Harlan) [21:55:51] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1132029|Add LoginNotify to disallowed local providers]], [[gerrit:1132025|Introduce configuration to deny logins from unknown systems (T390315)]], [[gerrit:1132027|Configure LoginNotify deny functionality (T390315)]] [21:55:55] T390315: LoginNotify: Provide configuration to deny login if attempt is made from new IP - https://phabricator.wikimedia.org/T390315 [21:57:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2321.codfw.wmnet with reason: host reimage [21:57:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2322.codfw.wmnet with reason: host reimage [21:57:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2323.codfw.wmnet with reason: host reimage [21:57:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2325.codfw.wmnet with reason: host reimage [22:01:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2324.codfw.wmnet with reason: host reimage [22:02:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2321.codfw.wmnet with reason: host reimage [22:05:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2324.codfw.wmnet with reason: host reimage [22:08:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2325.codfw.wmnet with reason: host reimage [22:10:29] !log sbassett@deploy1003 sbassett, tgr, mszabo, kharlan: Backport for [[gerrit:1132029|Add LoginNotify to disallowed local providers]], [[gerrit:1132025|Introduce configuration to deny logins from unknown systems (T390315)]], [[gerrit:1132027|Configure LoginNotify deny functionality (T390315)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:10:34] T390315: LoginNotify: Provide configuration to deny login if attempt is made from new IP - https://phabricator.wikimedia.org/T390315 [22:12:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2322.codfw.wmnet with reason: host reimage [22:12:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:12:57] !log sbassett@deploy1003 sbassett, tgr, mszabo, kharlan: Continuing with sync [22:16:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2323.codfw.wmnet with reason: host reimage [22:17:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [22:17:55] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:18:12] (03CR) 10Cwhite: [C:03+2] prometheus: remove unused rate2m recording rules for edit count [puppet] - 10https://gerrit.wikimedia.org/r/1131295 (owner: 10Filippo Giunchedi) [22:19:35] (03CR) 10Cwhite: [C:03+1] hieradata: move k8s prometheus1005 -> 1007 [puppet] - 10https://gerrit.wikimedia.org/r/1131301 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [22:23:23] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132029|Add LoginNotify to disallowed local providers]], [[gerrit:1132025|Introduce configuration to deny logins from unknown systems (T390315)]], [[gerrit:1132027|Configure LoginNotify deny functionality (T390315)]] (duration: 27m 32s) [22:23:28] T390315: LoginNotify: Provide configuration to deny login if attempt is made from new IP - https://phabricator.wikimedia.org/T390315 [22:23:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:26:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:26:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2321.codfw.wmnet with OS bookworm [22:27:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2321.codfw.wmnet with OS... [22:27:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:27:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2324.codfw.wmnet with OS bookworm [22:27:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:27:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2324.codfw.wmnet with OS... [22:28:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:28:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2322.codfw.wmnet with OS bookworm [22:28:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2322.codfw.wmnet with OS... [22:28:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:28:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:28:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2325.codfw.wmnet with OS bookworm [22:28:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687915 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2325.codfw.wmnet with OS... [22:32:08] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:32:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:32:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2323.codfw.wmnet with OS bookworm [22:33:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2323.codfw.wmnet with OS... [22:34:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2326.codfw.wmnet with OS bookworm [22:34:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2326.codfw.wmnet with... [22:34:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2327.codfw.wmnet with OS bookworm [22:34:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2327.codfw.wmnet with... [22:35:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2328.codfw.wmnet with OS bookworm [22:35:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2328.codfw.wmnet with... [22:35:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2329.codfw.wmnet with OS bookworm [22:35:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2329.codfw.wmnet with... [22:35:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2330.codfw.wmnet with OS bookworm [22:35:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10687940 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2330.codfw.wmnet with... [22:42:41] debugging, please do not use scap for mediawiki [22:45:15] (03PS2) 10Cwhite: logstash: stringify 'assignments' from eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/1131672 (https://phabricator.wikimedia.org/T390140) (owner: 10Filippo Giunchedi) [22:46:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2327.codfw.wmnet with reason: host reimage [22:46:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2326.codfw.wmnet with reason: host reimage [22:46:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2328.codfw.wmnet with reason: host reimage [22:46:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2329.codfw.wmnet with reason: host reimage [22:47:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2330.codfw.wmnet with reason: host reimage [22:49:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2327.codfw.wmnet with reason: host reimage [22:50:11] (03CR) 10Cwhite: [C:03+2] logstash: stringify 'assignments' from eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/1131672 (https://phabricator.wikimedia.org/T390140) (owner: 10Filippo Giunchedi) [22:51:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2330.codfw.wmnet with reason: host reimage [22:54:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2329.codfw.wmnet with reason: host reimage [22:57:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2328.codfw.wmnet with reason: host reimage [23:00:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2326.codfw.wmnet with reason: host reimage [23:05:15] (done) [23:05:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:07:04] (03CR) 10Cwhite: [C:03+1] prometheus: add function to replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128779 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [23:10:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:15:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:20:45] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:22:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:22:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2326.codfw.wmnet with OS bookworm [23:23:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2326.codfw.wmnet with OS... [23:23:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:23:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:23:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2329.codfw.wmnet with OS bookworm [23:23:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2329.codfw.wmnet with OS... [23:24:47] (03PS2) 10Aaron Schulz: services: update codfw changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) [23:27:22] (03PS2) 10Aaron Schulz: services: update codfw changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) [23:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:27:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:27:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2327.codfw.wmnet with OS bookworm [23:27:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:27:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2328.codfw.wmnet with OS bookworm [23:27:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [23:27:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2330.codfw.wmnet with OS bookworm [23:27:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688022 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2327.codfw.wmnet with OS... [23:27:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2328.codfw.wmnet with OS... [23:27:59] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2330.codfw.wmnet with OS... [23:29:32] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688025 (10Jhancock.wm) [23:32:26] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10688038 (10Jhancock.wm) @Clement_Goubert i finished all but one server (2331). Luca is trying to find a solution to our latest cookbook hurdle with... [23:36:39] (03PS2) 10Aaron Schulz: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) [23:36:40] (03PS1) 10Aaron Schulz: services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132039 (https://phabricator.wikimedia.org/T381588) [23:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 6.875% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 18.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy