[00:02:03] PROBLEM - statsv Varnishkafka log producer on cp7007 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:13] PROBLEM - statsv Varnishkafka log producer on cp4037 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:13] PROBLEM - Webrequests Varnishkafka log producer on cp4038 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:14] PROBLEM - Webrequests Varnishkafka log producer on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:16] PROBLEM - statsv Varnishkafka log producer on cp4038 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:17] PROBLEM - statsv Varnishkafka log producer on cp7005 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:02:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2277.codfw.wmnet with reason: host reimage [00:03:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2284.codfw.wmnet with reason: host reimage [00:03:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2285.codfw.wmnet with reason: host reimage [00:03:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2292.codfw.wmnet with reason: host reimage [00:04:53] (03CR) 10Ssingh: [C:03+1] upgrade cp5032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130746 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [00:06:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2277.codfw.wmnet with reason: host reimage [00:09:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2292.codfw.wmnet with reason: host reimage [00:13:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2284.codfw.wmnet with reason: host reimage [00:16:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2285.codfw.wmnet with reason: host reimage [00:20:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:21:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:21:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2277.codfw.wmnet with OS bookworm [00:21:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676643 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2277.codfw.wmnet with OS... [00:24:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:26:17] (03CR) 10Ladsgroup: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber) [00:28:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:28:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2292.codfw.wmnet with OS bookworm [00:29:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2292.codfw.wmnet with OS... [00:29:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:30:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:30:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2284.codfw.wmnet with OS bookworm [00:30:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2284.codfw.wmnet with OS... [00:31:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:33:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:33:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2285.codfw.wmnet with OS bookworm [00:33:54] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2285.codfw.wmnet with OS... [00:36:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:37:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2279.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:37:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:37:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:38:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2280.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:38:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:38:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2282.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:38:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2283.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [00:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131113 [00:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131113 (owner: 10TrainBranchBot) [00:40:48] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2279.codfw.wmnet with OS bookworm [00:40:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2280.codfw.wmnet with OS bookworm [00:41:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2279.codfw.wmnet with... [00:41:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2280.codfw.wmnet with... [00:41:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2282.codfw.wmnet with OS bookworm [00:41:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2282.codfw.wmnet with... [00:41:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2283.codfw.wmnet with OS bookworm [00:41:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2283.codfw.wmnet with... [00:50:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131113 (owner: 10TrainBranchBot) [00:52:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2280.codfw.wmnet with reason: host reimage [00:52:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2282.codfw.wmnet with reason: host reimage [00:52:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2279.codfw.wmnet with reason: host reimage [00:52:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2283.codfw.wmnet with reason: host reimage [00:55:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676698 (10phaultfinder) [00:56:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2280.codfw.wmnet with reason: host reimage [00:59:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2283.codfw.wmnet with reason: host reimage [01:01:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2282.codfw.wmnet with reason: host reimage [01:04:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2279.codfw.wmnet with reason: host reimage [01:08:31] (03PS3) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) [01:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131116 [01:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131116 (owner: 10TrainBranchBot) [01:10:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:10:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:10:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2280.codfw.wmnet with OS bookworm [01:11:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676711 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2280.codfw.wmnet with OS... [01:11:55] !log zabe@mwmaint1002:~$ cat group1.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php {} --deletedump /home/zabe/afl_text_table_deletedump/{} --dump /home/zabe/afl_text_table_dump/{} --sleep 0.3" # T381599 [01:11:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676712 (10Jhancock.wm) [01:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:00] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [01:13:15] (03PS4) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) [01:13:24] (03CR) 10Bvibber: "confirmed, done in upcoming patchset" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber) [01:13:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:14:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:14:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2283.codfw.wmnet with OS bookworm [01:14:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2283.codfw.wmnet with OS... [01:15:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676715 (10phaultfinder) [01:16:19] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:16:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:16:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2282.codfw.wmnet with OS bookworm [01:16:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2282.codfw.wmnet with OS... [01:18:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:18:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:18:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2279.codfw.wmnet with OS bookworm [01:18:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2279.codfw.wmnet with OS... [01:20:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2286.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2287.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:20:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2288.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:21:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2289.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [01:23:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2286.codfw.wmnet with OS bookworm [01:23:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676720 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2286.codfw.wmnet with... [01:24:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2287.codfw.wmnet with OS bookworm [01:24:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2287.codfw.wmnet with... [01:24:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2288.codfw.wmnet with OS bookworm [01:24:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2288.codfw.wmnet with... [01:24:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2289.codfw.wmnet with OS bookworm [01:24:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676723 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2289.codfw.wmnet with... [01:28:00] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131116 (owner: 10TrainBranchBot) [01:35:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2287.codfw.wmnet with reason: host reimage [01:35:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2288.codfw.wmnet with reason: host reimage [01:35:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2286.codfw.wmnet with reason: host reimage [01:36:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2289.codfw.wmnet with reason: host reimage [01:38:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2287.codfw.wmnet with reason: host reimage [01:41:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2288.codfw.wmnet with reason: host reimage [01:45:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2286.codfw.wmnet with reason: host reimage [01:49:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2289.codfw.wmnet with reason: host reimage [01:52:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:53:38] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10676730 (10Krinkle) The Village pump thread continues to have more reports, and certainly conf... [01:57:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:00:44] (03PS2) 10Robertsky: updating wikimaniawiki namespace configurations: add 2027/28 and associated talk namespaces. Enable subpages, visual editor, for years namespaces, update wgContentNamespaces with years namespaces, update default site search to 2025 namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [02:02:38] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:02:39] (03PS3) 10Robertsky: updating wikimaniawiki namespace configurations: add 2027/28 and associated talk namespaces. Enable subpages, visual editor, for years namespaces, update wgContentNamespaces with years namespaces, update default site search to 2025 namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [02:03:44] (03PS4) 10Robertsky: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [02:07:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:11:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:11:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2287.codfw.wmnet with OS bookworm [02:11:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:11:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2288.codfw.wmnet with OS bookworm [02:11:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:11:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:11:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2289.codfw.wmnet with OS bookworm [02:11:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2286.codfw.wmnet with OS bookworm [02:11:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2287.codfw.wmnet with OS... [02:11:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2288.codfw.wmnet with OS... [02:11:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2289.codfw.wmnet with OS... [02:11:23] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2286.codfw.wmnet with OS... [02:12:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10676745 (10Jhancock.wm) [02:21:05] (03PS1) 10Robertsky: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) [02:43:53] (03CR) 10Robertsky: updating wikimaniawiki namespace configurations: (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [02:50:22] (03CR) 10Robertsky: update wikimaniawiki perms configurations: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [02:50:42] (03CR) 10Robertsky: update wikimaniawiki perms configurations: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:36:58] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10676863 (10phaultfinder) [03:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676883 (10phaultfinder) [03:45:33] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10676884 (10phaultfinder) [03:46:36] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#10676886 (10Papaul) patch panel detail connection diagram {F58922230} [03:54:46] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10676888 (10phaultfinder) [03:56:41] (03PS1) 10KartikMistry: Update recommendation-api to 2025-03-25-091801-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131128 (https://phabricator.wikimedia.org/T306508) [04:03:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:18:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:59:01] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2025-03-25-091801-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131128 (https://phabricator.wikimedia.org/T306508) (owner: 10KartikMistry) [05:00:42] (03Merged) 10jenkins-bot: Update recommendation-api to 2025-03-25-091801-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131128 (https://phabricator.wikimedia.org/T306508) (owner: 10KartikMistry) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:54] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [05:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10676963 (10phaultfinder) [05:17:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) (owner: 10Nik Gkountas) [05:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677028 (10phaultfinder) [05:34:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677034 (10phaultfinder) [05:54:05] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10677040 (10phaultfinder) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T0600) [06:10:29] (03PS1) 10Marostegui: db1152,db2142: Disable sync_binlog [puppet] - 10https://gerrit.wikimedia.org/r/1131131 (https://phabricator.wikimedia.org/T387332) [06:10:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677045 (10phaultfinder) [06:11:15] (03CR) 10Marostegui: "I've made this change live." [puppet] - 10https://gerrit.wikimedia.org/r/1131131 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [06:11:18] (03CR) 10Marostegui: [C:03+2] db1152,db2142: Disable sync_binlog [puppet] - 10https://gerrit.wikimedia.org/r/1131131 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [06:13:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2220', diff saved to https://phabricator.wikimedia.org/P74419 and previous config saved to /var/cache/conftool/dbconfig/20250326-061320-marostegui.json [06:13:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Lagging [06:20:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74420 and previous config saved to /var/cache/conftool/dbconfig/20250326-062011-root.json [06:22:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677069 (10phaultfinder) [06:32:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74421 and previous config saved to /var/cache/conftool/dbconfig/20250326-063517-root.json [06:44:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677077 (10phaultfinder) [06:49:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677083 (10phaultfinder) [06:50:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74422 and previous config saved to /var/cache/conftool/dbconfig/20250326-065022-root.json [06:50:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2181 T381475', diff saved to https://phabricator.wikimedia.org/P74423 and previous config saved to /var/cache/conftool/dbconfig/20250326-065037-marostegui.json [06:50:42] T381475: Productionize x3 hosts - https://phabricator.wikimedia.org/T381475 [06:52:35] (03PS1) 10Marostegui: db2241: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1131219 (https://phabricator.wikimedia.org/T381475) [06:53:36] (03CR) 10Marostegui: [C:03+2] db2241: Productionize [puppet] - 10https://gerrit.wikimedia.org/r/1131219 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:55:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677096 (10phaultfinder) [07:02:49] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2181.codfw.wmnet onto db2241.codfw.wmnet [07:05:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74426 and previous config saved to /var/cache/conftool/dbconfig/20250326-070527-root.json [07:08:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677139 (10phaultfinder) [07:19:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2220 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74427 and previous config saved to /var/cache/conftool/dbconfig/20250326-072033-root.json [07:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677153 (10phaultfinder) [07:27:53] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10677156 (10Aklapper) p:05Medium→03High * `PHP Warni... [07:50:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677181 (10phaultfinder) [07:52:20] (03CR) 10Slyngshede: [C:03+2] Alert when mirrors become out of date [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:53:58] (03Merged) 10jenkins-bot: Alert when mirrors become out of date [alerts] - 10https://gerrit.wikimedia.org/r/1130964 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:55:00] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10677195 (10phaultfinder) [08:00:05] Amir1, Urbanecm, and awight: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T0800) [08:00:05] zip and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:54] morning! [08:00:59] I knew I'd forgotten something.... [08:01:26] here [08:01:48] zip: go ahead with your patch and ping me when done. [08:01:48] mine's a script change on a script that nobody except my team should be running, so my test process is gonna be running it with `--dry-run` later today and then resurfacing for another backport window if we really fucked it up [08:02:06] oh uh, I've not really run one of these before [08:02:44] but if you don't mind hanging around to help me double check my moves I can go ahead [08:03:50] Not so familiar with the current script either :/ [08:04:55] Just do --dry-run as script is suppose to run manually, right? [08:05:12] (03PS1) 10Slyngshede: P:systemd::timesyncd remove deprecated Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1131259 (https://phabricator.wikimedia.org/T350694) [08:06:33] alright, I'll get started [08:06:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zoe@deploy1003 using scap backport" [extensions/Flow] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130989 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [08:07:45] (03CR) 10Tiziano Fogli: [C:03+1] "@ayounsi@wikimedia.org you're welcome" [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [08:13:43] this is not a fast process, is it? [08:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677208 (10phaultfinder) [08:15:18] (03CR) 10Tiziano Fogli: [C:03+1] karma: strip sre-irc receiver if duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1131028 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi) [08:15:48] (03Merged) 10jenkins-bot: Archive user talk pages even if the userpage doesn't exist [extensions/Flow] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1130989 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [08:15:54] zip: CI on extensions can be slow, you can check progress here https://integration.wikimedia.org/zuul/ under gate-and-submit-wmf [08:16:19] ah it just finished :) [08:16:26] !log zoe@deploy1003 Started scap sync-world: Backport for [[gerrit:1130989|Archive user talk pages even if the userpage doesn't exist (T380911)]] [08:16:30] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [08:16:49] (03PS1) 10Muehlenhoff: Fix tracking entry for user [puppet] - 10https://gerrit.wikimedia.org/r/1131262 [08:16:56] (03CR) 10Nikerabbit: "Should this be tagged with T389920 instead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) (owner: 10Nik Gkountas) [08:16:58] I should probably add "zoe@" to my highlight list, along with "zoe)" [08:20:42] (03CR) 10Volans: [C:03+2] interactive: add NullHandler to the notify logger [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131032 (owner: 10Volans) [08:23:17] !log zoe@deploy1003 zoe: Backport for [[gerrit:1130989|Archive user talk pages even if the userpage doesn't exist (T380911)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:23:21] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [08:23:22] !log zoe@deploy1003 zoe: Continuing with sync [08:25:39] (03Merged) 10jenkins-bot: interactive: add NullHandler to the notify logger [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131032 (owner: 10Volans) [08:28:21] (03CR) 10Muehlenhoff: [C:03+2] Fix tracking entry for user [puppet] - 10https://gerrit.wikimedia.org/r/1131262 (owner: 10Muehlenhoff) [08:28:48] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.3.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131263 [08:28:58] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v1.3.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131263 (owner: 10Volans) [08:30:36] !log zoe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130989|Archive user talk pages even if the userpage doesn't exist (T380911)]] (duration: 14m 10s) [08:30:41] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [08:31:14] great! kart_ I'm all done [08:32:00] cool. [08:33:30] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.3.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1131263 (owner: 10Volans) [08:34:55] (03PS1) 10Volans: Upstream release v1.3.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1131264 [08:35:01] (03CR) 10Volans: [C:03+2] Upstream release v1.3.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1131264 (owner: 10Volans) [08:37:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) (owner: 10Nik Gkountas) [08:37:34] (03PS1) 10Muehlenhoff: partman: ganeti5-efi.cfg: Fix RAID offsets [puppet] - 10https://gerrit.wikimedia.org/r/1131265 [08:38:07] (03Merged) 10jenkins-bot: Add all language codes to SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131050 (https://phabricator.wikimedia.org/T387821) (owner: 10Nik Gkountas) [08:38:30] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1131050|Add all language codes to SectionTranslationTargetLanguages (T387821)]] [08:38:35] T387821: Deploy unified dashboard on more wikis (phase 3) - https://phabricator.wikimedia.org/T387821 [08:39:38] (03Merged) 10jenkins-bot: Upstream release v1.3.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/1131264 (owner: 10Volans) [08:41:28] (03CR) 10Seanleong-wmde: "Ahh, I'm not sure if I should run it in the beta cluster first or just go straight to the main one. What do you suggest?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [08:43:39] 10ops-drmrs: Port with no description on access switch - https://phabricator.wikimedia.org/T390028 (10phaultfinder) 03NEW [08:45:17] !log kartik@deploy1003 kartik, ngkountas: Backport for [[gerrit:1131050|Add all language codes to SectionTranslationTargetLanguages (T387821)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:45:22] T387821: Deploy unified dashboard on more wikis (phase 3) - https://phabricator.wikimedia.org/T387821 [08:45:35] (03CR) 10Muehlenhoff: [C:03+2] partman: ganeti5-efi.cfg: Fix RAID offsets [puppet] - 10https://gerrit.wikimedia.org/r/1131265 (owner: 10Muehlenhoff) [08:47:14] (03PS1) 10Federico Ceratto: clone.py: Retry fetching remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) [08:47:30] !log kartik@deploy1003 kartik, ngkountas: Continuing with sync [08:47:36] !log uploaded python3-wmflib_1.3.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [08:47:38] !log installing dnsmasq security updates [08:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:15] (03CR) 10Filippo Giunchedi: [C:03+2] karma: strip sre-irc receiver if duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1131028 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi) [08:49:39] (03CR) 10Filippo Giunchedi: alertmanager: Add mediawiki-platform-task (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [08:49:45] (03CR) 10Thiemo Kreuz (WMDE): "Oh, that's fine! I suggest to update the commit message then to make it more obvious that this is the intention of this patch. E.g. "Incre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [08:50:44] (03CR) 10Filippo Giunchedi: [C:03+1] "\o/ \o/ \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1131259 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:50:57] (03CR) 10Marostegui: [C:03+1] "Let's see if this fixes the immediate issues" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [08:51:03] (03CR) 10Slyngshede: [C:03+2] P:systemd::timesyncd remove deprecated Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1131259 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:52:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.654s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:53:26] (03PS5) 10Gehel: style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 [08:54:59] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131050|Add all language codes to SectionTranslationTargetLanguages (T387821)]] (duration: 16m 28s) [08:55:03] T387821: Deploy unified dashboard on more wikis (phase 3) - https://phabricator.wikimedia.org/T387821 [08:55:41] !log Deployed: Add all language codes to SectionTranslationTargetLanguages (T389920) [08:55:42] (03CR) 10Gehel: [C:03+2] style(query_service): extract common alerting configuration [puppet] - 10https://gerrit.wikimedia.org/r/1130631 (owner: 10Gehel) [08:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:46] T389920: CX Language selector entrypoint: TypeError: cxLanguageMatches is null - https://phabricator.wikimedia.org/T389920 [08:55:49] Manually logged ^ [08:57:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.654s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:58:10] (03CR) 10Volans: clone.py: Retry fetching remote host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [09:00:08] andre and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T0900). [09:01:31] (03CR) 10Marostegui: [C:03+1] clone.py: Retry fetching remote host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [09:01:41] (03PS1) 10KartikMistry: recommendation-api: Fix typo in version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131268 [09:01:49] (03CR) 10CI reject: [V:04-1] recommendation-api: Fix typo in version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131268 (owner: 10KartikMistry) [09:02:01] (03PS2) 10KartikMistry: recommendation-api: Fix typo in version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131268 [09:02:38] (03CR) 10Marostegui: [C:03+1] clone.py: Retry fetching remote host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [09:07:15] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131269 (https://phabricator.wikimedia.org/T386217) [09:07:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2001.codfw.wmnet with OS bookworm [09:07:17] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131269 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:07:53] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10677360 (10ayounsi) [09:08:07] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131269 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:11:48] (03PS1) 10Federico Ceratto: clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) [09:12:49] (03CR) 10Federico Ceratto: clone.py: Retry fetching remote host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [09:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677383 (10phaultfinder) [09:17:47] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677390 (10aborrero) >>! In T389958#10674585, @cmooney wrote: > @aborrero as discussed we can possibly arrange a window for Thurs Mar 27th to carry out the remaining st... [09:19:52] (03PS5) 10Seanleong-wmde: Increase entityAccessLimit on the beta cluster from 400 to 500 for all wikis except Commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) [09:20:00] (03CR) 10CI reject: [V:04-1] Increase entityAccessLimit on the beta cluster from 400 to 500 for all wikis except Commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [09:20:24] (03CR) 10Seanleong-wmde: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [09:22:12] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.22 refs T386217 [09:22:17] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [09:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677416 (10phaultfinder) [09:25:24] (03PS1) 10Kevin Bazira: ml-services: update article-country image and weighted_tags env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131274 (https://phabricator.wikimedia.org/T389768) [09:25:43] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677419 (10taavi) [09:27:02] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677421 (10taavi) [09:29:26] (03PS1) 10Slyngshede: apereo_cas::service exclude tools and servicegroups by default [puppet] - 10https://gerrit.wikimedia.org/r/1131275 [09:30:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2001.codfw.wmnet with reason: host reimage [09:32:22] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2181.codfw.wmnet onto db2241.codfw.wmnet [09:33:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2001.codfw.wmnet with reason: host reimage [09:36:43] (03PS1) 10Marostegui: mariadb: Productionize db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1131278 (https://phabricator.wikimedia.org/T381475) [09:37:11] (03CR) 10Volans: clone.py: Retry fetching remote host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [09:37:29] FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:00] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1131278 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [09:39:40] RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:53] (03PS1) 10Slyngshede: apereo_cas::service: remove unused service entry [labs/private] - 10https://gerrit.wikimedia.org/r/1131280 [09:42:56] (03CR) 10Filippo Giunchedi: prometheus: add recording rules for use by histogram_quantile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [09:44:11] (03PS1) 10Elukey: role::ml_k8s::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) [09:44:20] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2181.codfw.wmnet onto db2242.codfw.wmnet [09:44:35] (03CR) 10CI reject: [V:04-1] role::ml_k8s::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677465 (10phaultfinder) [09:45:24] (03PS2) 10Elukey: role::ml_k8s::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) [09:48:28] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5152/" [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:49:18] (03CR) 10Elukey: role::ml_k8s::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:50:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2001.codfw.wmnet with OS bookworm [09:58:02] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5154/" [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:58:39] (03CR) 10Klausman: [V:03+2 C:03+2] role::ml_k8s::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131282 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:58:43] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10677536 (10phaultfinder) [09:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677542 (10phaultfinder) [10:00:04] andre and jnuche: May I have your attention please! MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T0900) [10:00:04] elukey, claime, and fabfur: MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1000). Please do the needful. [10:00:32] Train is already deployed to group1. [10:00:45] we'll do the infra window later since e.lukey is in a meeting [10:01:17] (03PS1) 10Slyngshede: Permission log: Remove user filter [software/bitu] - 10https://gerrit.wikimedia.org/r/1131285 [10:01:26] (03CR) 10Joely Rooke WMDE: "LGTM : )" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [10:02:51] (03PS1) 10DCausse: wdqs: enable hive/hdfs ingestion for rdf update streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T389812) [10:05:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [10:08:34] (03CR) 10Clément Goubert: [C:03+1] httpbb: Test /view/fr/Z1 case-insensitively [puppet] - 10https://gerrit.wikimedia.org/r/1131101 (https://phabricator.wikimedia.org/T383032) (owner: 10RLazarus) [10:09:05] (03PS2) 10DCausse: wdqs: enable hive/hdfs ingestion for rdf update streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T388372) [10:09:32] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677602 (10aborrero) announcement: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/LX6KDZMQHEL3NZ3DMWQERI2O3YVSDDKM/ [10:10:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677606 (10phaultfinder) [10:11:01] (03PS1) 10Slyngshede: CAS Docker: Remove keystore file [software/bitu] - 10https://gerrit.wikimedia.org/r/1131288 [10:12:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [10:13:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to cluster codfw_test and group A-test [10:14:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to cluster codfw_test and group A-test [10:18:32] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [10:20:36] !log joal@deploy1003 Started deploy [analytics/refinery@2364d83]: Analytics webrequest_frontend update [analytics/refinery@2364d83c] [10:21:32] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:21:42] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:22:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [10:22:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [10:22:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [10:22:38] !log joal@deploy1003 Finished deploy [analytics/refinery@2364d83]: Analytics webrequest_frontend update [analytics/refinery@2364d83c] (duration: 02m 01s) [10:22:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [10:23:01] jouncebot: nowandnext [10:23:01] For the next 0 hour(s) and 36 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T0900) [10:23:01] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1000) [10:23:02] In 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1100) [10:23:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [10:23:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:24:29] !log joal@deploy1003 Started deploy [analytics/refinery@2364d83] (thin): Analytics webrequest_frontend update THIN [analytics/refinery@2364d83c] [10:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677676 (10phaultfinder) [10:24:44] (03PS1) 10Ladsgroup: Bump thumbnail steps to 45% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131290 (https://phabricator.wikimedia.org/T360589) [10:25:29] !log joal@deploy1003 Finished deploy [analytics/refinery@2364d83] (thin): Analytics webrequest_frontend update THIN [analytics/refinery@2364d83c] (duration: 00m 59s) [10:27:31] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10677686 (10hashar) There were ~ 9300 of them yesterday:... [10:27:47] (03CR) 10Slyngshede: [C:03+2] CAS Docker: Remove keystore file [software/bitu] - 10https://gerrit.wikimedia.org/r/1131288 (owner: 10Slyngshede) [10:28:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10677687 (10MoritzMuehlenhoff) [10:29:06] elukey, claime, fabfur (as you are all mentioned for overlapping "MediaWiki infrastructure (UTC mid-day)" on https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_March_26 [10:29:15] I need to roll back the train for https://phabricator.wikimedia.org/T390032#10677674 [10:29:19] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 45% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131290 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:29:34] and our windows overlap, so just a heads-up that I'll do so now [10:29:35] andre: go ahead, we won't start until 1100 because of an overlapping meeting [10:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677692 (10phaultfinder) [10:29:47] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:29:51] Amir1: you jouncebot'd, heads up [10:30:11] (03Merged) 10jenkins-bot: Bump thumbnail steps to 45% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131290 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:30:19] I cancelled my patch [10:30:25] andre: feel free to go ahead [10:30:28] (03Merged) 10jenkins-bot: CAS Docker: Remove keystore file [software/bitu] - 10https://gerrit.wikimedia.org/r/1131288 (owner: 10Slyngshede) [10:30:31] thanks [10:30:51] you can deploy my patch too at the same time (if it's not possible to deploy separately) [10:31:07] it's very straightforward. It wont' break anything [10:31:22] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131291 (https://phabricator.wikimedia.org/T386217) [10:31:24] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131291 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [10:31:47] Moving back from .22 to old .21 on group1 now [10:32:02] 🥳 [10:32:10] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131291 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [10:33:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [10:33:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-eqdfw:ae0 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:34:40] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:40] (03CR) 10Muehlenhoff: [C:03+1] "It used to exist, but was recently decommissioned, so let's link https://phabricator.wikimedia.org/T389172 to the commit message?" [labs/private] - 10https://gerrit.wikimedia.org/r/1131280 (owner: 10Slyngshede) [10:38:34] (03PS2) 10Federico Ceratto: clone.py: Retry fetching remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) [10:38:34] (03PS2) 10Federico Ceratto: clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) [10:41:54] (03CR) 10David Caro: [C:03+1] openstack: rename lan-flat-cloudinstances2b to VLAN/legacy [puppet] - 10https://gerrit.wikimedia.org/r/1131013 (https://phabricator.wikimedia.org/T389942) (owner: 10Arturo Borrero Gonzalez) [10:42:07] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: rename lan-flat-cloudinstances2b to VLAN/legacy [puppet] - 10https://gerrit.wikimedia.org/r/1131013 (https://phabricator.wikimedia.org/T389942) (owner: 10Arturo Borrero Gonzalez) [10:44:08] (03PS1) 10Muehlenhoff: Switch new ganeti servers to use EFI [puppet] - 10https://gerrit.wikimedia.org/r/1131293 (https://phabricator.wikimedia.org/T384838) [10:45:15] (03PS2) 10Slyngshede: apereo_cas::service: remove unused service entry [labs/private] - 10https://gerrit.wikimedia.org/r/1131280 [10:45:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677735 (10phaultfinder) [10:45:55] (03PS2) 10Muehlenhoff: Switch new ganeti servers to use EFI [puppet] - 10https://gerrit.wikimedia.org/r/1131293 (https://phabricator.wikimedia.org/T384838) [10:47:07] (03CR) 10Muehlenhoff: apereo_cas::service: remove unused service entry [labs/private] - 10https://gerrit.wikimedia.org/r/1131280 (owner: 10Slyngshede) [10:48:00] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.22 refs T386217 [10:48:06] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [10:48:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [10:49:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [10:50:34] FYI Train rollback finished [10:50:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677749 (10phaultfinder) [10:50:56] I deploy my change then [10:51:07] Amir1: go ahead :) [10:51:13] Thanks <3 [10:51:14] thanks for the waiting and sorry for the interruption [10:51:26] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131290|Bump thumbnail steps to 45% (T360589)]] [10:51:30] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:51:33] no worries. Sorry for this mess. I have to do the deployment 5% every day. It's sooo boring [10:51:36] (03CR) 10Federico Ceratto: "Partially "tested" using dry-run" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [10:53:03] oh yeah, that task :-/ [10:55:58] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [10:56:38] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1131290|Bump thumbnail steps to 45% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:56:42] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:57:24] (03CR) 10Hnowlan: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [10:58:33] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:58:50] (03PS5) 10Bvibber: Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) [10:58:53] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add JsonConfig's globaljsonlinks tables to catalog [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber) [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1100). nyaa~ [11:00:21] claime: ready whenever the rollback finishes [11:00:52] I see there's a couple trains going out – I'd like to (dry-run) my scripts, can I do so after this deploy? [11:03:08] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924#10677774 (10SLyngshede-WMF) We talked it over and believe that the correct approach is to remove the field for the Phabricator ticket. It's already optional and if people need to refer to... [11:03:22] zip: How long would the runs be? [11:04:26] (03CR) 10Federico Ceratto: "I "tested" it with dry-run together with https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1131272" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [11:04:35] claime: we should probably move our patch then, wdyt? [11:04:38] we will redeploy all of mw-on-k8s for https://phabricator.wikimedia.org/T318285 and this time we need to test it out fully so it may take a bit [11:04:41] there is no rush, we can try tomorrow [11:04:48] (03CR) 10Marostegui: [C:03+1] clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [11:05:12] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924#10677778 (10MoritzMuehlenhoff) >>! In T389924#10674483, @Scott_French wrote: > Followed up with @acooper out of band: > > In short, the guidance in the self-service access request flow in... [11:05:19] (03CR) 10Marostegui: [C:03+1] "Let's merge and discuss further actions, if we need more on the task" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [11:05:38] (03CR) 10Gkyziridis: [C:03+1] "Thnx Kevin!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131274 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [11:05:50] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131290|Bump thumbnail steps to 45% (T360589)]] (duration: 14m 23s) [11:05:52] claime: oh like 2 minutes each, the main wait is for the pod to start [11:05:54] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:05:58] zip: ok send it [11:06:03] yeeting immediately [11:06:05] and we'll do our deploy afterwards [11:06:18] elukey: we should be fine to do it today, I'd really like to have it out of the way [11:06:41] I could probably streamline this by not following the logs and gathering them afterwards but it'd take me longer to work out how to do that sensibly [11:06:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131285 (owner: 10Slyngshede) [11:07:02] it's fine zip, dwai [11:07:15] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [11:07:26] oh you know what, I can just ssh a second time [11:07:59] or use tmux and open a second buffer :p [11:09:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2002.codfw.wmnet with OS bookworm [11:09:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10677786 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bookworm [11:09:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [11:10:27] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10677790 (10cmooney) [11:10:34] all done [11:11:04] I'll wanna run these for real later but for now I'll be updating tickets and reading the output to double check that it looks sensible [11:11:10] cool [11:11:12] elukey: ready? [11:11:25] ready yes! [11:11:37] do you want to drive, should I do it? [11:11:48] if you do it I am fine, easier and more precise :D [11:11:50] !log Disabling puppet on deploy1003 - T318285 [11:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:55] T318285: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285 [11:11:56] I'll do it, you test ? [11:12:00] sure [11:12:20] (03CR) 10Clément Goubert: [C:03+2] www.wikipedia.org: fix "search" URL parameter [puppet] - 10https://gerrit.wikimedia.org/r/1123622 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [11:13:00] Pre deploy mwdebug httpbb tests are green [11:13:48] cc: fabfur: --^ [11:13:51] we are starting [11:14:08] !log Enabling puppet on deploy1003 - T318285 [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:27] !log Running puppet on deploy1003 - T318285 [11:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:07] !log joal@deploy1003 Started deploy [analytics/refinery@2364d83] (hadoop-test): Analytics webrequest_frontend update TEST [analytics/refinery@2364d83c] [11:16:27] puppet runs are so long [11:16:55] !log joal@deploy1003 Finished deploy [analytics/refinery@2364d83] (hadoop-test): Analytics webrequest_frontend update TEST [analytics/refinery@2364d83c] (duration: 00m 47s) [11:16:59] this is needed to allow you to chat with your colleagues [11:17:02] been loading facts for 2 minutes [11:17:08] otherwise you are always deep into coding [11:17:14] lol [11:17:26] zonebreaking as a service [11:17:43] :D [11:17:57] I have the mwdebug extension ON in my browser for www.wikipedia.org [11:18:10] pointing to mw-debug-k8s yeah? [11:18:14] yep [11:18:19] cool [11:18:48] I am checking the task to see the various use ases [11:19:07] !log Running httpbb tests on deploy1003 before deploying apache change, should fail - T318285 [11:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:11] T318285: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285 [11:19:21] (03PS4) 10Filippo Giunchedi: prometheus: add recording rules for use by histogram_quantile [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [11:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677817 (10phaultfinder) [11:19:37] https://www.wikipedia.org/?search=foobar (/srv/deployment/httpbb-tests/appserver/test_wwwportals.yaml:27) [11:19:39] Status code: expected 200, got 301. [11:19:41] https://www.wikipedia.org/?search=foobar&uselang=de (/srv/deployment/httpbb-tests/appserver/test_wwwportals.yaml:29) [11:19:43] Status code: expected 200, got 301. [11:19:46] fails as expected, deploying [11:20:21] (03PS1) 10Filippo Giunchedi: prometheus: remove unused rate2m recording rules for edit count [puppet] - 10https://gerrit.wikimedia.org/r/1131295 [11:21:10] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: add recording rules for use by histogram_quantile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [11:21:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.19s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:21:27] ^ not a problem, codfw depooled [11:22:08] !log cgoubert@deploy1003 Started scap sync-world: T318285 - 1123622 - www.wikipedia.org: fix search URL parameter [11:23:04] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti-test2002.codfw.wmnet with OS bookworm [11:23:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10677822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bookworm executed with errors: - ganeti-... [11:23:27] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2002.codfw.wmnet'] [11:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677837 (10phaultfinder) [11:24:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [11:25:08] (03CR) 10Hnowlan: [C:03+1] Allow dot in revision title [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [11:25:33] I'm doing some learning the hard way. I've backported my change to .22, so my script change is where I need it on one wiki, but not the others that are on .21.... [11:25:33] (03PS1) 10Filippo Giunchedi: pontoon: add license header to config [puppet] - 10https://gerrit.wikimedia.org/r/1131277 [11:25:35] (03PS3) 10Filippo Giunchedi: pontoon: add integration tests for rolegroup bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1128331 [11:25:49] my question then is: will those wikis be on .22 soon or do I need to backport to .21 [11:25:57] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [11:26:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int releases routed via main (k8s) 1.37s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:26:47] !log cgoubert@deploy1003 cgoubert: T318285 - 1123622 - www.wikipedia.org: fix search URL parameter synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:26:50] !log Running httpbb tests on deploy1003 before deploying apache change, should pass - T318285 [11:26:51] T318285: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285 [11:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:05] elukey: you can start external testing [11:27:14] ack [11:27:16] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129831 (owner: 10PipelineBot) [11:27:19] PASS: 151 requests sent to mwdebug.discovery.wmnet. All assertions passed. [11:27:23] httpbb testing green [11:28:08] (03PS1) 10Volans: debian: update signing key [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1131297 [11:28:46] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129831 (owner: 10PipelineBot) [11:30:19] (03PS2) 10Volans: debian: update signing key [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1131297 [11:30:58] mmm something doesn't work with my extension, trying curl [11:31:08] elukey: what's not working? [11:32:47] claime: for some reason I get the 301, when trying https://www.wikipedia.org/?search=Does-not-exist%20&uselang=de - the extension is on and pointing to mw-k8s-debug [11:33:07] with curl everything works [11:33:08] I get a 200 [11:33:27] what do you use, k8s-mwdebug? [11:33:31] yeah [11:33:44] and looking at dev tools with Disable Cache toggled [11:34:34] okok that's it, all good [11:34:47] ok to proceed with the rest of prod then? [11:35:03] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:35:04] +1 [11:35:12] then purge cache for these pages, and test again yeah? [11:35:17] !log cgoubert@deploy1003 cgoubert: Continuing with sync [11:35:25] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:36:00] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:36:10] (03CR) 10Elukey: [C:03+1] debian: update signing key [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1131297 (owner: 10Volans) [11:36:44] claime: +1 yes [11:36:46] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add license header to config [puppet] - 10https://gerrit.wikimedia.org/r/1131277 (owner: 10Filippo Giunchedi) [11:36:50] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add integration tests for rolegroup bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1128331 (owner: 10Filippo Giunchedi) [11:38:05] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:38:26] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:38:49] aha, I'm caught up [11:39:03] !log Running puppet on cumin1002 to deploy new httpbb tests - T318285 [11:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:07] T318285: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285 [11:39:43] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:40:52] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129829 (owner: 10PipelineBot) [11:40:58] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129828 (owner: 10PipelineBot) [11:41:29] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129830 (owner: 10PipelineBot) [11:41:48] !log cgoubert@deploy1003 Finished scap sync-world: T318285 - 1123622 - www.wikipedia.org: fix search URL parameter (duration: 20m 34s) [11:42:34] elukey: go ahead for purge and test in prod [11:42:47] !log Running new httpbb tests on cumin1002 - T318285 [11:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:20] !log filippo@cumin1002 conftool action : set/weight=10; selector: name=prometheus1007.eqiad.wmnet [11:44:25] !log filippo@cumin1002 conftool action : set/weight=10; selector: name=prometheus1008.eqiad.wmnet [11:45:27] FTR those hosts are being prepped, they are not pooled in pybal [11:45:49] !log New httpbb tests on cumin1002 green - T318285 [11:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:53] (03PS1) 10Elukey: benthos: update the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) [11:45:54] T318285: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285 [11:46:15] (03PS1) 10Filippo Giunchedi: hieradata: move k8s prometheus1005 -> 1007 [puppet] - 10https://gerrit.wikimedia.org/r/1131301 (https://phabricator.wikimedia.org/T383232) [11:46:16] (03PS1) 10Filippo Giunchedi: hieradata: move k8s prometheus1006 -> 1008 [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) [11:46:29] (03CR) 10Volans: [C:03+2] debian: update signing key [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1131297 (owner: 10Volans) [11:47:24] Hmm should uselang=de change the dropdown language to DE? [11:47:26] claime: the only thing that I know is to run `echo 'https://www.wikipedia.org' | mwscript purgeList.php` from mwmaint1002 [11:47:48] Let me run that with mwscript-k8s from deployu [11:47:53] I wondered the same, but didn't find it in the task's descr [11:49:39] elukey: cache purged for https://www.wikipedia.org and https://www.wikipedia.org/?search [11:49:50] nice :) [11:49:51] looks like it works for me [11:50:01] yep, success! [11:50:33] !log Cache purged for https://www.wikipedia.org and https://www.wikipedia.org/?search - T318285 [11:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:46] !log Deployment done - T318285 [11:50:50] 👍 [11:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:20] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10677935 (10Clement_Goubert) The search box now populates correctly. Question left... [11:53:38] 06SRE, 06Discovery-Search, 06serviceops, 10Wikimedia-Apache-configuration, and 3 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10677936 (10Clement_Goubert) 05Open→03Resolved [11:55:19] (03PS2) 10Clément Goubert: alertmanager: Add mediawiki-platform-task [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) [11:56:04] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm [11:58:11] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5155/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [11:59:43] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10677942 (10phaultfinder) [12:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10677949 (10phaultfinder) [12:00:59] (03CR) 10Slyngshede: [C:03+2] Permission log: Remove user filter [software/bitu] - 10https://gerrit.wikimedia.org/r/1131285 (owner: 10Slyngshede) [12:02:25] 06SRE, 06Infrastructure-Foundations, 10netops: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044 (10cmooney) 03NEW p:05Triage→03Low [12:02:51] (03CR) 10Slyngshede: [V:03+2 C:03+2] apereo_cas::service: remove unused service entry [labs/private] - 10https://gerrit.wikimedia.org/r/1131280 (owner: 10Slyngshede) [12:03:30] 06SRE, 06Infrastructure-Foundations, 10netops: Classify ceph traffic flows for network prioritization - https://phabricator.wikimedia.org/T390044#10677969 (10cmooney) [12:03:31] (03Merged) 10jenkins-bot: Permission log: Remove user filter [software/bitu] - 10https://gerrit.wikimedia.org/r/1131285 (owner: 10Slyngshede) [12:03:49] (03CR) 10Elukey: "Tried to find an easy solution, lemme know!" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [12:04:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5157/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131275 (owner: 10Slyngshede) [12:04:53] (03CR) 10Clément Goubert: alertmanager: Add mediawiki-platform-task (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [12:05:21] (03Merged) 10jenkins-bot: debian: update signing key [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/1131297 (owner: 10Volans) [12:07:00] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [12:07:16] (03PS1) 10Cory Massaro: wikifunctions: Update orchestrator from 2025-03-19-203723 to 2025-03-25-145119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131305 (https://phabricator.wikimedia.org/T386426) [12:10:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage [12:13:16] (03CR) 10Hnowlan: "I think this might not be an effective change - the `timeout` paramter is a protobuf Duration object, which is expressed with the s suffix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 (owner: 10Elukey) [12:14:34] (03CR) 10Kevin Bazira: [C:03+2] "ευχαριστώ :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131274 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [12:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678047 (10phaultfinder) [12:16:05] (03Merged) 10jenkins-bot: ml-services: update article-country image and weighted_tags env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131274 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [12:19:13] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:19:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2002.codfw.wmnet'] [12:21:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2002.codfw.wmnet with OS bookworm [12:21:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10678061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bookworm [12:23:15] (03CR) 10Slyngshede: [C:03+2] Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:24:29] (03Merged) 10jenkins-bot: Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678088 (10phaultfinder) [12:24:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm [12:25:55] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:28:28] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:29:43] (03CR) 10KartikMistry: [C:03+2] recommendation-api: Fix typo in version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131268 (owner: 10KartikMistry) [12:31:04] (03Merged) 10jenkins-bot: recommendation-api: Fix typo in version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131268 (owner: 10KartikMistry) [12:31:23] (03CR) 10Michael Große: [C:03+1] Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) (owner: 10Cyndywikime) [12:32:29] RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:42] !log btullis@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [12:37:18] !log kartik@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:37:30] (03CR) 10Muehlenhoff: [C:03+2] Enable maps-test2003 to maps-test2006 as additional maps bookworm replicas [puppet] - 10https://gerrit.wikimedia.org/r/1115863 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:37:58] (03PS1) 10Ayounsi: gNMIc: subscribe to alerts states [puppet] - 10https://gerrit.wikimedia.org/r/1131306 (https://phabricator.wikimedia.org/T388641) [12:39:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: host reimage [12:40:01] (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] Increase entityAccessLimit on the beta cluster from 400 to 500 for all wikis except Commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [12:40:20] (03PS2) 10Ayounsi: gNMIc: subscribe to alerts states [puppet] - 10https://gerrit.wikimedia.org/r/1131306 (https://phabricator.wikimedia.org/T388641) [12:41:08] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131306 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:41:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1002.eqiad.wmnet [12:42:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: host reimage [12:43:12] (03CR) 10FNegri: [C:03+2] Failover all dumps traffic to clouddumps1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131051 (https://phabricator.wikimedia.org/T383723) (owner: 10FNegri) [12:43:42] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10678176 (10thcipriani) →14Duplicate dup:03T387833 [12:45:49] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 17.8 [puppet] - 10https://gerrit.wikimedia.org/r/1131309 (https://phabricator.wikimedia.org/T390049) [12:46:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1131309 (https://phabricator.wikimedia.org/T390049) (owner: 10Jelto) [12:47:03] (03PS1) 10Gergő Tisza: Enable SUL3 login for 50% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131310 (https://phabricator.wikimedia.org/T384219) [12:47:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131310 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [12:47:44] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.8 [puppet] - 10https://gerrit.wikimedia.org/r/1131309 (https://phabricator.wikimedia.org/T390049) (owner: 10Jelto) [12:53:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10678214 (10MoritzMuehlenhoff) [12:56:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678229 (10phaultfinder) [12:57:03] (03PS2) 10Slyngshede: C:raid:perccli do not error out if controller is no in use [puppet] - 10https://gerrit.wikimedia.org/r/1126542 [12:57:26] (03CR) 10CI reject: [V:04-1] C:raid:perccli do not error out if controller is no in use [puppet] - 10https://gerrit.wikimedia.org/r/1126542 (owner: 10Slyngshede) [13:00:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2002.codfw.wmnet with OS bookworm [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1300). [13:00:05] phuedx and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10678233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bookworm completed: - ganeti-test2002 (*... [13:00:48] o/ [13:04:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [13:04:53] (03PS1) 10Kevin Bazira: ml-services: update rrla staging image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131315 (https://phabricator.wikimedia.org/T326179) [13:07:45] !log running sendBulkEmail.php as per T389064#10676651 [13:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:50] T389064: Notify WebAuthn users about SUL3 changes - https://phabricator.wikimedia.org/T389064 [13:07:53] I'll deploy [13:09:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [13:09:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131310 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:10:38] (03Merged) 10jenkins-bot: ext-EventStreamConfig: Reduce product_metrics.web_base data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [13:10:40] (03Merged) 10jenkins-bot: Enable SUL3 login for 50% of group 2 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131310 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:11:03] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1129270|ext-EventStreamConfig: Reduce product_metrics.web_base data collection]], [[gerrit:1131310|Enable SUL3 login for 50% of group 2 users (T384219)]] [13:11:08] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [13:12:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [13:16:58] !log tgr@deploy1003 phuedx, tgr: Backport for [[gerrit:1129270|ext-EventStreamConfig: Reduce product_metrics.web_base data collection]], [[gerrit:1131310|Enable SUL3 login for 50% of group 2 users (T384219)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:02] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [13:18:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2002.codfw.wmnet to cluster codfw_test and group A-test [13:18:08] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2002.codfw.wmnet to cluster codfw_test and group A-test [13:18:19] tgr_: LGTM [13:18:26] !log tgr@deploy1003 phuedx, tgr: Continuing with sync [13:18:30] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.8 [13:18:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2002.codfw.wmnet to cluster codfw_test and group A-test [13:18:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2002.codfw.wmnet to cluster codfw_test and group A-test [13:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678308 (10phaultfinder) [13:25:37] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1129270|ext-EventStreamConfig: Reduce product_metrics.web_base data collection]], [[gerrit:1131310|Enable SUL3 login for 50% of group 2 users (T384219)]] (duration: 14m 33s) [13:25:42] T384219: SUL3 Phase 4: Staged rollout for all existing users - https://phabricator.wikimedia.org/T384219 [13:26:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.8 [13:27:44] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.8 [13:28:28] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10678323 (10matmarex) >>! In T369186#9995966, @matmarex... [13:28:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10678324 (10Papaul) @Marostegui we will to do this today. thanks [13:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678335 (10phaultfinder) [13:30:04] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10678339 (10Papaul) [13:30:25] !log UTC afternoon deploys done [13:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] (03PS4) 10Bking: cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) [13:31:15] (03CR) 10CI reject: [V:04-1] cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [13:33:18] jouncebot: nowandnext [13:33:18] For the next 0 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1300) [13:33:18] In 0 hour(s) and 26 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1400) [13:34:14] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10678347 (10joanna_borun) Approved [13:34:24] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-19-203723 to 2025-03-25-145119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131318 (https://phabricator.wikimedia.org/T386426) [13:34:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [13:34:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti-test2002.codfw.wmnet [13:34:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [13:36:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10678363 (10ops-monitoring-bot) Draining ganeti-test2003.codfw.wmnet of running VMs [13:36:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.8 [13:36:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [13:37:37] (03PS1) 10Ayounsi: gNMIc: collect BFD stats [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) [13:37:48] (03PS5) 10Bking: cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) [13:38:46] (03PS2) 10Ayounsi: gNMIc: collect BFD states [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) [13:38:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [13:39:00] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.8 [13:39:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10678394 (10ops-monitoring-bot) Draining ganeti-test2003.codfw.wmnet of running VMs [13:39:17] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10678395 (10MatthewVernon) I don't think those are swift... [13:41:10] (03PS3) 10Ayounsi: gNMIc: collect BFD states [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) [13:41:36] (03CR) 10Filippo Giunchedi: [C:03+1] alertmanager: Add mediawiki-platform-task [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [13:42:24] !log btullis@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1001.eqiad.wmnet [13:43:23] (03CR) 10Btullis: [C:03+2] component: puppet dumps web enterprise page update [puppet] - 10https://gerrit.wikimedia.org/r/1130196 (owner: 10Creynolds) [13:43:37] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [13:44:05] Thanks tgr_ [13:46:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.407s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:47:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1001.eqiad.wmnet [13:49:06] (03CR) 10AikoChou: [C:03+1] ml-services: update rrla staging image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131315 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [13:49:46] (03PS1) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) [13:50:56] (03CR) 10Bking: [C:03+2] cirrussearch: create a puppet plan for beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [13:51:12] (03CR) 10Bking: [C:03+2] "self-merging, as this does not affect a production environment" [puppet] - 10https://gerrit.wikimedia.org/r/1131098 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [13:51:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.407s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:51:55] (03PS1) 10Slyngshede: Revert "Netbox alerting: Add remaining netbox alert." [alerts] - 10https://gerrit.wikimedia.org/r/1131326 [13:53:00] (03CR) 10Kevin Bazira: [C:03+2] "谢谢 :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131315 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [13:53:45] (03PS1) 10Ilias Sarantopoulos: admin_ng: increase pod/container limitranges fo revision models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131327 (https://phabricator.wikimedia.org/T387019) [13:54:33] (03PS2) 10Ilias Sarantopoulos: api-gateway: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) [13:54:34] (03Merged) 10jenkins-bot: ml-services: update rrla staging image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131315 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [13:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678516 (10phaultfinder) [13:54:40] (03PS3) 10Ilias Sarantopoulos: api-gateway: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) [13:55:39] (03CR) 10Ayounsi: [C:03+1] Revert "Netbox alerting: Add remaining netbox alert." [alerts] - 10https://gerrit.wikimedia.org/r/1131326 (owner: 10Slyngshede) [13:56:35] (03CR) 10Slyngshede: [C:03+2] Revert "Netbox alerting: Add remaining netbox alert." [alerts] - 10https://gerrit.wikimedia.org/r/1131326 (owner: 10Slyngshede) [13:59:33] (03Merged) 10jenkins-bot: Revert "Netbox alerting: Add remaining netbox alert." [alerts] - 10https://gerrit.wikimedia.org/r/1131326 (owner: 10Slyngshede) [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1400) [14:00:15] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10678589 (10phaultfinder) [14:02:38] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:03:09] James_F: do you plan on using this wf services window? [14:03:16] hnowlan: Yes. [14:03:20] ack [14:03:25] hnowlan: But if you're going on something different it shouldn't conflict. [14:04:16] James_F: I'd like to repool codfw - shouldn't have much impact on you [14:04:20] Ack. [14:05:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [14:05:54] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678669 (10phaultfinder) [14:06:00] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10678670 (10fnegri) > @dcaro is concerned about active NFS mounts from pods, those might require a restart of the NFS server (unless Puppet is doin... [14:06:08] (03PS23) 10Tiziano Fogli: sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) [14:06:24] !log hnowlan@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site codfw [reason: Datacentre switchover repool, T385155] [14:06:29] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [14:06:32] (03Abandoned) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-19-203723 to 2025-03-25-145119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131318 (https://phabricator.wikimedia.org/T386426) (owner: 10Jforrester) [14:06:32] elukey: looking at the CR shortly [14:06:36] (03CR) 10Daphne Smit: [C:03+2] wikifunctions: Update orchestrator from 2025-03-19-203723 to 2025-03-25-145119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131305 (https://phabricator.wikimedia.org/T386426) (owner: 10Cory Massaro) [14:06:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site codfw [reason: Datacentre switchover repool, T385155] [14:07:44] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-03-19-203723 to 2025-03-25-145119 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131305 (https://phabricator.wikimedia.org/T386426) (owner: 10Cory Massaro) [14:08:01] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti-test2003.codfw.wmnet [14:08:12] !log hnowlan@cumin1002 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: Datacentre switchover repool - T385155 [14:08:41] (03PS4) 10Brouberol: global_config: export the IPs of the mariadb es servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) [14:09:58] !log daphnesmit@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:58] !log daphnesmit@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:12:15] (03CR) 10Brouberol: [C:03+1] [airflow] - Increase the limit on the maximum number of mapped tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131036 (https://phabricator.wikimedia.org/T389773) (owner: 10Btullis) [14:12:38] (03CR) 10Btullis: [C:03+2] [airflow] - Increase the limit on the maximum number of mapped tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131036 (https://phabricator.wikimedia.org/T389773) (owner: 10Btullis) [14:12:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:12:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:13:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2281.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:13:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:13:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2290.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:13:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2278.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:13:29] !log daphnesmit@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:14:15] (03Merged) 10jenkins-bot: [airflow] - Increase the limit on the maximum number of mapped tasks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131036 (https://phabricator.wikimedia.org/T389773) (owner: 10Btullis) [14:14:22] !log daphnesmit@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:14:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10678734 (10phaultfinder) [14:14:50] !log daphnesmit@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:15:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2278.codfw.wmnet with OS bookworm [14:15:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10678739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2278.codfw.wmnet with... [14:15:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2281.codfw.wmnet with OS bookworm [14:15:37] !log daphnesmit@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:15:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10678741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2281.codfw.wmnet with... [14:15:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2290.codfw.wmnet with OS bookworm [14:15:57] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10678744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2290.codfw.wmnet with... [14:18:46] All done at our end. [14:19:19] grand, thanks [14:19:35] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10678758 (10dcaro) >>! In T383723#10678670, @fnegri wrote: >> @dcaro is concerned about active NFS mounts from pods, those might require a restart... [14:22:50] (03CR) 10Btullis: global_config: export the IPs of the mariadb es servers in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [14:23:48] (03PS1) 10DCausse: cirrus: add extra opensearch cluster in the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) [14:23:49] (03PS1) 10DCausse: cirrus: allow writing to eqiad-opensearch in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131334 (https://phabricator.wikimedia.org/T389971) [14:23:51] (03PS1) 10DCausse: cirrus: use only deployment-cirrussearch*.deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) [14:24:35] (03CR) 10CI reject: [V:04-1] cirrus: add extra opensearch cluster in the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [14:24:46] (03CR) 10CI reject: [V:04-1] cirrus: allow writing to eqiad-opensearch in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131334 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [14:26:05] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T389973#10678795 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cord on pdu side. alert cleared. [14:27:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2278.codfw.wmnet with reason: host reimage [14:27:00] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10678807 (10phaultfinder) [14:27:03] 10ops-codfw, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390062 (10phaultfinder) 03NEW [14:27:13] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10678812 (10fnegri) > It's not that the service not need restarting, but if there's any processes (ex. pods) that have a file open from before the... [14:27:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2290.codfw.wmnet with reason: host reimage [14:27:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2281.codfw.wmnet with reason: host reimage [14:28:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:28:55] 10ops-eqiad, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390064 (10phaultfinder) 03NEW [14:30:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: Datacentre switchover repool - T385155 [14:30:26] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [14:30:49] (03CR) 10Federico Ceratto: [C:03+1] "Ok, merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [14:30:50] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Retry fetching remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [14:31:07] (03CR) 10Federico Ceratto: [C:03+1] clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [14:31:11] (03PS2) 10DCausse: cirrus: use search-psi to point to opensearch cluster in the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) [14:31:15] (03PS2) 10DCausse: cirrus: allow writing to eqiad-opensearch in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131334 (https://phabricator.wikimedia.org/T389971) [14:31:19] (03PS2) 10DCausse: cirrus: use only deployment-cirrussearch*.deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) [14:31:23] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [14:31:27] (03CR) 10Fabfur: [C:03+1] "all good on my side" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [14:31:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2278.codfw.wmnet with reason: host reimage [14:31:57] (03PS1) 10Ahmon Dancy: data.yaml: Fix journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1131338 (https://phabricator.wikimedia.org/T383945) [14:32:02] (03CR) 10Federico Ceratto: [C:03+1] clone.py, clone_test.py: Check if the target host is known to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [14:32:06] (03CR) 10Federico Ceratto: [C:03+2] clone.py, clone_test.py: Check if the target host is known to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1127071 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [14:32:59] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for acooper - https://phabricator.wikimedia.org/T389924#10678879 (10Scott_French) Great, thank you both! Dropping the ticket field sounds like a solid solution. [14:33:43] (03CR) 10Thcipriani: [C:03+1] "More context on deployment and testing here:" [puppet] - 10https://gerrit.wikimedia.org/r/1130715 (owner: 10Ahmon Dancy) [14:33:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:34:35] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:34:37] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131339 [14:34:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1131338 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:34:58] (03CR) 10Muehlenhoff: [C:03+2] data.yaml: Fix journalctl access to spiderpig services [puppet] - 10https://gerrit.wikimedia.org/r/1131338 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [14:35:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [14:35:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2290.codfw.wmnet with reason: host reimage [14:36:58] (03Merged) 10jenkins-bot: clone.py: Retry fetching remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/1131266 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [14:38:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2281.codfw.wmnet with reason: host reimage [14:39:53] (03PS1) 10Bking: deployment-prep: set correct opensearch version for new cirrussearch servers [puppet] - 10https://gerrit.wikimedia.org/r/1131340 (https://phabricator.wikimedia.org/T389971) [14:40:06] (03CR) 10Muehlenhoff: [C:03+1] "JFTR, we'll make this the default, but merging this as an interim fix." [puppet] - 10https://gerrit.wikimedia.org/r/1131026 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [14:40:09] (03CR) 10Muehlenhoff: [C:03+2] P:idp Limit groups sent from CAS to Spiderpig [puppet] - 10https://gerrit.wikimedia.org/r/1131026 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [14:40:35] (03CR) 10DCausse: [C:03+1] deployment-prep: set correct opensearch version for new cirrussearch servers [puppet] - 10https://gerrit.wikimedia.org/r/1131340 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [14:41:15] (03CR) 10Bking: [C:03+2] deployment-prep: set correct opensearch version for new cirrussearch servers [puppet] - 10https://gerrit.wikimedia.org/r/1131340 (https://phabricator.wikimedia.org/T389971) (owner: 10Bking) [14:46:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [14:46:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti-test2003.codfw.wmnet [14:46:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:47:29] FIRING: [2x] ProbeDown: Service logstash1030:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1030:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2003.codfw.wmnet with OS bookworm [14:47:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10678952 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bookworm [14:50:44] (03PS3) 10Federico Ceratto: clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) [14:51:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:53:24] (03CR) 10Hnowlan: [C:03+1] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131339 (owner: 10Muehlenhoff) [14:54:40] RESOLVED: [2x] ProbeDown: Service logstash1030:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1030:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 17.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:56:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:57:09] big spike of parsoidCachePrewarm jobs in codfw [14:58:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.79s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:01:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 10.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:03:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.79s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:04:22] p99 still a bit high, but the prewarm jobs are gradually improving [15:07:48] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [15:09:05] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679073 (10hashar) 05Declined→03Open > For disabling an account and on checking internally with SRE, there is no formal process currently in place. I know the IDM can block a user, it is... [15:09:07] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] clone.py: Ask for depooling only when needed [cookbooks] - 10https://gerrit.wikimedia.org/r/1131272 (https://phabricator.wikimedia.org/T390025) (owner: 10Federico Ceratto) [15:11:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: host reimage [15:15:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: host reimage [15:15:41] (03CR) 10Brouberol: [V:03+1] global_config: export the IPs of the mariadb es servers in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:16:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:16:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2281.codfw.wmnet with OS bookworm [15:16:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:16:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2278.codfw.wmnet with OS bookworm [15:16:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679095 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2281.codfw.wmnet with OS... [15:16:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:16:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2290.codfw.wmnet with OS bookworm [15:16:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679096 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2278.codfw.wmnet with OS... [15:16:27] (03PS5) 10Brouberol: global_config: export the IPs of the mariadb es servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) [15:16:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2290.codfw.wmnet with OS... [15:16:56] (03PS6) 10Brouberol: global_config: export the IPs of the mariadb es servers [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) [15:16:56] (03CR) 10Brouberol: global_config: export the IPs of the mariadb es servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:17:48] 06SRE, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679108 (10Pppery) Anyone in the list of people at https://github.com/wikimedia/operations-puppet/blob/a8e90bc9358f3c0d53c567ae3ba62903a1b400f7/modules/idm/templates/idm-django-settings.erb#L... [15:19:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5161/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:20:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2295.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2294.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:20:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2293.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:21:56] !log dancy@deploy1003 Installing scap version "4.144.2" for 2 host(s) [15:22:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2293.codfw.wmnet with OS bookworm [15:22:13] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679133 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2293.codfw.wmnet with... [15:22:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2294.codfw.wmnet with OS bookworm [15:22:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2295.codfw.wmnet with OS bookworm [15:22:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2294.codfw.wmnet with... [15:22:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2295.codfw.wmnet with... [15:23:24] (03PS7) 10Brouberol: global_config: export the IPs of the mariadb es servers [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) [15:23:42] !log dancy@deploy1003 Installation of scap version "4.144.2" completed for 2 hosts [15:24:19] !log installing Exim security updates [15:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:15] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5162/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:27:19] (03CR) 10Joal: "Two questions :)" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [15:29:29] (03CR) 10Klausman: [C:03+1] admin_ng: increase pod/container limitranges fo revision models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131327 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:31:21] (03PS2) 10Elukey: benthos: update the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) [15:31:22] (03CR) 10Elukey: benthos: update the webrequest_live instance (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [15:32:03] (03CR) 10Elukey: "Perfect, abandoning :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 (owner: 10Elukey) [15:32:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2003.codfw.wmnet with OS bookworm [15:32:05] (03Abandoned) 10Elukey: api-gateway: set the rate-limiter's timeout to ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130574 (owner: 10Elukey) [15:32:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10679202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bookworm completed: - ganeti-test2003 (*... [15:33:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2293.codfw.wmnet with reason: host reimage [15:33:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2294.codfw.wmnet with reason: host reimage [15:33:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2295.codfw.wmnet with reason: host reimage [15:35:42] (03PS1) 10Brouberol: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) [15:36:11] (03PS2) 10Jdlrobson: Web features should not be ambiguously configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) [15:36:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2293.codfw.wmnet with reason: host reimage [15:37:36] (03PS1) 10Jforrester: Instead of calling deprecated parserOptions(), parse content ourselves [extensions/FundraiserLandingPage] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131348 (https://phabricator.wikimedia.org/T390032) [15:37:52] (03CR) 10Btullis: [C:03+1] global_config: export the IPs of the mariadb es servers [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:38:29] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve-ctrl2001.codfw.wmnet with OS bookworm [15:39:34] (03CR) 10Btullis: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:40:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2294.codfw.wmnet with reason: host reimage [15:40:56] (03CR) 10RLazarus: [C:03+2] httpbb: Test /view/fr/Z1 case-insensitively [puppet] - 10https://gerrit.wikimedia.org/r/1131101 (https://phabricator.wikimedia.org/T383032) (owner: 10RLazarus) [15:42:14] (03PS1) 10Volans: redfish: wait few seconds in scp_dump [software/spicerack] - 10https://gerrit.wikimedia.org/r/1131352 [15:42:20] (03CR) 10Jforrester: [C:03+1] httpbb: Test /view/fr/Z1 case-insensitively [puppet] - 10https://gerrit.wikimedia.org/r/1131101 (https://phabricator.wikimedia.org/T383032) (owner: 10RLazarus) [15:43:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2295.codfw.wmnet with reason: host reimage [15:44:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2003.codfw.wmnet to cluster codfw_test and group A-test [15:44:35] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2003.codfw.wmnet to cluster codfw_test and group A-test [15:45:12] (03CR) 10Elukey: [C:03+1] redfish: wait few seconds in scp_dump [software/spicerack] - 10https://gerrit.wikimedia.org/r/1131352 (owner: 10Volans) [15:47:55] (03PS2) 10Brouberol: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) [15:47:55] (03PS1) 10Brouberol: Duplicate external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131354 (https://phabricator.wikimedia.org/T390059) [15:47:57] (03PS1) 10Brouberol: modules/base/external-services-networkpolicy: allow the override of the whole selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131355 (https://phabricator.wikimedia.org/T390059) [15:48:16] (03PS1) 10Jakob: Configure virtual terms db for wikidata prod & test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) [15:48:31] (03CR) 10Elukey: [C:03+1] Switch new ganeti servers to use EFI [puppet] - 10https://gerrit.wikimedia.org/r/1131293 (https://phabricator.wikimedia.org/T384838) (owner: 10Muehlenhoff) [15:48:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [15:49:23] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: export the IPs of the mariadb es servers [puppet] - 10https://gerrit.wikimedia.org/r/1131328 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:49:32] (03PS2) 10Brouberol: modules/base/external-services-networkpolicy: allow the override of the whole selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131355 (https://phabricator.wikimedia.org/T390059) [15:49:32] (03PS3) 10Brouberol: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) [15:49:46] !log kartik@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:51:50] (03PS4) 10Brouberol: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) [15:52:05] (03CR) 10Brouberol: "Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:52:18] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:52:30] (03PS2) 10Ilias Sarantopoulos: admin_ng: increase pod/container limitranges for revision models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131327 (https://phabricator.wikimedia.org/T387019) [15:53:17] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: host reimage [15:53:25] (03CR) 10Ilias Sarantopoulos: [C:03+2] admin_ng: increase pod/container limitranges for revision models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131327 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:53:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [15:54:06] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:54:26] (03PS2) 10Clément Goubert: team-sre: Add mw-cron alerting [alerts] - 10https://gerrit.wikimedia.org/r/1131356 (https://phabricator.wikimedia.org/T385709) [15:54:31] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679315 (10bd808) 05Open→03In progress a:03bd808 [15:56:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: host reimage [15:57:09] (03PS6) 10Bking: elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) [15:57:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2003.codfw.wmnet to cluster codfw_test and group A-test [15:57:50] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:58:00] (03CR) 10Bking: "PCC is failing, but it's only on hosts that have been inactive for a loooong time. So we can ignore this." [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:58:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2003.codfw.wmnet to cluster codfw_test and group A-test [15:58:21] (03CR) 10Brouberol: [C:03+2] Duplicate external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131354 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:58:24] (03CR) 10Brouberol: [C:03+2] modules/base/external-services-networkpolicy: allow the override of the whole selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131355 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:58:26] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [15:58:36] Hello SRE team - for some time now you've had 2 datasources in Druid to explore Webrequest: webrequest_sampled_128 (the old one) and webrequest_sampled_live (the new one). We're confident now that the new one does the job, and we're removing the old one. If you have dashboards/tools still using the `webrequest_sampled_128` datasource, please switch to using `webrequest_sampled_live`. Thank you! [15:58:42] (ticket for reference: https://phabricator.wikimedia.org/T385198) [15:59:02] (03PS2) 10Scott French: deployment_server: Default to PHP 8.1 in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1131351 (https://phabricator.wikimedia.org/T387917) [15:59:28] joal: o/ better to post to #wikimedia-sre, this chan is more noisy and people may not read the msg [15:59:35] (03Merged) 10jenkins-bot: admin_ng: increase pod/container limitranges for revision models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131327 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:59:58] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T390077 (10phaultfinder) 03NEW [16:00:15] (03CR) 10Scott French: "Thanks in advance for the review, Reuven, and for the idea to add the help messages. Take your time, as I clearly won't merge this until t" [puppet] - 10https://gerrit.wikimedia.org/r/1131351 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [16:00:27] (03PS3) 10Filippo Giunchedi: benthos: update the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [16:00:36] (03CR) 10BCornwall: [C:03+2] upgrade cp5032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1130746 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:00:41] (03CR) 10Muehlenhoff: [C:03+2] Switch new ganeti servers to use EFI [puppet] - 10https://gerrit.wikimedia.org/r/1131293 (https://phabricator.wikimedia.org/T384838) (owner: 10Muehlenhoff) [16:00:49] (03CR) 10Alexandros Kosiaris: [C:03+2] profile::keyholder::server::agents: Add deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1130715 (owner: 10Ahmon Dancy) [16:00:49] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I did just a minor edit in PS3" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [16:00:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:00:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2294.codfw.wmnet with OS bookworm [16:00:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:00:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2294.codfw.wmnet with OS... [16:00:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2293.codfw.wmnet with OS bookworm [16:01:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:01:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2293.codfw.wmnet with OS... [16:01:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:01:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2295.codfw.wmnet with OS bookworm [16:01:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679400 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2295.codfw.wmnet with OS... [16:02:07] thanks for the pointer elukey :) [16:02:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:02:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:02:44] (03PS1) 10Ebernhardson: Move cirrus traffic to eqiad for platform upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131359 [16:02:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:02:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2296.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:05] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679406 (10bd808) >>! In T388662#10641905, @ssingh wrote: > For disabling an account and on checking internally with SRE, there is no formal process... [16:03:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2297.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 6.176% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:03:49] akosiaris: I'll merge your keyholder patch along, ok? [16:04:04] (03Merged) 10jenkins-bot: Duplicate external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131354 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:04:05] (03Merged) 10jenkins-bot: modules/base/external-services-networkpolicy: allow the override of the whole selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131355 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:04:10] moritzm: yes please [16:04:18] I was waiting for the lock to be released [16:04:21] (03PS2) 10Ebernhardson: Move cirrus traffic to eqiad for platform upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131359 (https://phabricator.wikimedia.org/T388610) [16:04:24] looks like you beat me to it [16:04:26] (03PS1) 10Tiziano Fogli: prometheus instances_cleanup: move assert that requires lvm vg0 [puppet] - 10https://gerrit.wikimedia.org/r/1131360 [16:04:31] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: enable egress to the maariadb eqiad external storage hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131347 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:04:40] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:09] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [16:05:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:05:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679420 (10Jhancock.wm) [16:05:47] we practically merged at the same time :-) both patches are puppet-merged now [16:05:49] (03PS1) 10DLynch: Edit check: in single action mode the fixed sidebar isn't allowed null offset [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131361 (https://phabricator.wikimedia.org/T389906) [16:05:51] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus instances_cleanup: move assert that requires lvm vg0 [puppet] - 10https://gerrit.wikimedia.org/r/1131360 (owner: 10Tiziano Fogli) [16:06:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131361 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [16:06:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 14.16s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:06:23] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10679423 (10MoritzMuehlenhoff) [16:06:34] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:06:56] !log brouberol@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:07:26] (03CR) 10DCausse: [C:03+1] Move cirrus traffic to eqiad for platform upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131359 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [16:07:29] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 13.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:08:15] (03PS1) 10Brouberol: Fix typo lost in rebase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131362 (https://phabricator.wikimedia.org/T390059) [16:08:20] !log Importing varnishkafka 1.2.0-1 into bullseye-wikimedia component/varnish-staging (T389978) [16:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:25] T389978: varnishkafka 1.1.0-5 exits on SIGHUP - https://phabricator.wikimedia.org/T389978 [16:08:32] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp5032.eqsin.wmnet} and A:cp [16:08:32] (03Abandoned) 10Brouberol: Fix typo lost in rebase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131362 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:08:39] (03CR) 10Filippo Giunchedi: "LGTM, I'll let netbox maintainers/experts vote tho" [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:09:47] (03PS2) 10Volans: context managers: combine them when feasible [cookbooks] - 10https://gerrit.wikimedia.org/r/1130676 [16:09:47] (03PS3) 10Volans: sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 [16:09:47] (03PS1) 10Volans: sre.ganeti.addnode: confirm verify on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) [16:09:59] (03PS2) 10Volans: sre.ganeti.addnode: confirm verify on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) [16:10:10] (03PS4) 10Volans: sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 [16:10:15] RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:10:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:10:38] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=ml-serve-ctrl2001.codfw.wmnet,dc=codfw,service=ml-ctrl [16:10:41] (03CR) 10Volans: [C:03+2] redfish: wait few seconds in scp_dump [software/spicerack] - 10https://gerrit.wikimedia.org/r/1131352 (owner: 10Volans) [16:10:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:11:03] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=ml-serve-ctrl2001.codfw.wmnet [16:11:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 14.55s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:11:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [16:11:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [16:12:32] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [16:12:44] !log Rolling out varnishkafka 1.2.0-1 to esams, ulsfo, eqsin, and magru [16:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:51] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, neat!" [alerts] - 10https://gerrit.wikimedia.org/r/1131356 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [16:12:53] !log Rolling out varnishkafka 1.2.0-1 to esams, ulsfo, eqsin, and magru (T389978) [16:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:13:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet with OS bookworm [16:13:42] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [16:13:57] (03PS1) 10Kimberly Sarabia: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) [16:14:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [16:15:04] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp5032.eqsin.wmnet} and A:cp [16:17:15] (03CR) 10Stoyofuku-wmf: [C:03+1] "exciting!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [16:17:26] (03PS1) 10BCornwall: upgrade cp6001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131367 (https://phabricator.wikimedia.org/T378737) [16:17:27] (03PS1) 10BCornwall: upgrade cp6002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131368 (https://phabricator.wikimedia.org/T378737) [16:17:28] (03PS1) 10BCornwall: upgrade cp6003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131369 (https://phabricator.wikimedia.org/T378737) [16:17:30] (03PS1) 10BCornwall: upgrade cp6004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131370 (https://phabricator.wikimedia.org/T378737) [16:17:31] (03PS1) 10BCornwall: upgrade cp6005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131371 (https://phabricator.wikimedia.org/T378737) [16:17:33] (03PS1) 10BCornwall: upgrade cp6006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131372 (https://phabricator.wikimedia.org/T378737) [16:17:37] (03PS1) 10BCornwall: upgrade cp6007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131373 (https://phabricator.wikimedia.org/T378737) [16:17:41] (03PS1) 10BCornwall: upgrade cp6008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131374 (https://phabricator.wikimedia.org/T378737) [16:17:45] (03PS1) 10BCornwall: upgrade cp6009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131375 (https://phabricator.wikimedia.org/T378737) [16:17:47] FIRING: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlserve@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:17:49] (03PS1) 10BCornwall: upgrade cp6010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131376 (https://phabricator.wikimedia.org/T378737) [16:17:54] (03PS1) 10BCornwall: upgrade cp6011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131377 (https://phabricator.wikimedia.org/T378737) [16:17:57] (03PS1) 10BCornwall: upgrade cp6012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131378 (https://phabricator.wikimedia.org/T378737) [16:18:02] (03PS1) 10BCornwall: upgrade cp6013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131379 (https://phabricator.wikimedia.org/T378737) [16:18:06] (03PS1) 10BCornwall: upgrade cp6014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131380 (https://phabricator.wikimedia.org/T378737) [16:18:10] (03PS1) 10BCornwall: upgrade cp6015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131381 (https://phabricator.wikimedia.org/T378737) [16:18:14] (03PS1) 10BCornwall: upgrade cp6016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131382 (https://phabricator.wikimedia.org/T378737) [16:18:48] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10679496 (10RobH) > We'll be installing the new optics into the original ports, and removing the old optics and patch. > > So please remove the optic patch D0100B and the op... [16:19:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10679499 (10MoritzMuehlenhoff) [16:19:05] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=ml-serve-ctrl2001.codfw.wmnet,dc=codfw,service=ml-ctrl [16:19:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10679500 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! As part of the process ganeti-test2001 was also switched to UEFI, so that I could test t... [16:20:21] (03PS1) 10BPirkle: REST: Enable REST Sandbox on an initial set of production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) [16:20:26] (03PS2) 10Tiziano Fogli: prometheus instances_cleanup: move assert that requires lvm vg0 [puppet] - 10https://gerrit.wikimedia.org/r/1131360 [16:20:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) (owner: 10Volans) [16:20:35] (03Merged) 10jenkins-bot: redfish: wait few seconds in scp_dump [software/spicerack] - 10https://gerrit.wikimedia.org/r/1131352 (owner: 10Volans) [16:23:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2298.codfw.wmnet with OS bookworm [16:23:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2298.codfw.wmnet with... [16:23:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2297.codfw.wmnet with OS bookworm [16:23:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679517 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2297.codfw.wmnet with... [16:23:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2296.codfw.wmnet with OS bookworm [16:23:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10679518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2296.codfw.wmnet with... [16:23:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:11] looking [16:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 2.632% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:24:21] erm [16:24:33] o/ [16:24:35] There was a parsoidPrewarm surge earlier [16:25:15] same timing as -1day? [16:25:29] err no, sorry, -20 mins [16:25:32] yeah, also a shark-fin requests graph at mw-web at 16:11 which might or might not be connected https://grafana.wikimedia.org/goto/NxiB5coHg?orgId=1 [16:25:47] No, a lot earlier and in codfw [16:25:51] So I don't think it's that [16:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 21.85s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:26:22] mw-parsoid saw a latency spike at 16:00 which cleared at 16:06, then a worse one at 16:21 which is ongoing [16:26:32] (03PS1) 10Muehlenhoff: Create insetup role for ML servers with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131385 (https://phabricator.wikimedia.org/T389825) [16:26:36] big spike in open connections to MySQL as well [16:27:17] (03PS3) 10Volans: sre.ganeti.addnode: confirm verify on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) [16:27:34] I don't think it's caused by MySQL though, more the other way around [16:27:47] RESOLVED: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlserve@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:27:53] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679537 (10bd808) 05In progress→03Resolved https://idm.wikimedia.org/wikimedia/block/barrybrowsertestbot/log {F58926043, size=full} [16:28:05] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10679542 (10Pppery) Random guess that came to mind: Is it possible that something starts a crit... [16:28:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) (owner: 10Volans) [16:28:29] it's not the cache prewarm [16:28:31] yeah this smells jobqueue-y to me, digging a little [16:28:40] !log kartik@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [16:28:42] no insertion spike [16:28:46] cacheprewarms are pretty high in codfw still [16:28:57] have been since ~14:20 [16:28:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 19.23% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:29:23] hnowlan: yeah just found the same [16:29:33] 06SRE, 10SRE-Access-Requests: Requesting deployment access for daphnesmit - https://phabricator.wikimedia.org/T388681#10679547 (10Jdforrester-WMF) 05In progress→03Resolved Confirmed with Daphne doing a deployment today that this is now fixed; thanks! [16:29:53] although not that much higher than what looks like the normal rate in eqiad [16:30:09] they're not handled by parsoid in codfw though right [16:30:18] because mw-parsoid in codfw is doing nothing [16:30:49] yeah [16:31:03] (03CR) 10Elukey: [C:03+1] context managers: combine them when feasible [cookbooks] - 10https://gerrit.wikimedia.org/r/1130676 (owner: 10Volans) [16:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 13.85s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:31:20] numbers still look a bit off - we're still seeing pre-switchover levels in eqiad, but we're seeing 1.5x pre-switchover levels in codfw [16:31:27] hnowlan: is ~14:20 just the codfw repool? [16:31:31] yeah [16:31:44] well, it's the trigger, but not an explanation of the numbers [16:31:47] yeah [16:32:11] (03CR) 10Tiziano Fogli: [C:03+2] prometheus instances_cleanup: move assert that requires lvm vg0 [puppet] - 10https://gerrit.wikimedia.org/r/1131360 (owner: 10Tiziano Fogli) [16:33:19] the p99 on parsoid is fishy [16:33:28] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=ml-serve-ctrl2001.codfw.wmnet [16:34:40] I know we're alerting on these p75 spikes, but a constant 25s+ p99 is unusual for this long [16:34:44] !log Updated recommendation-api-ng to 2025-03-25-091801-production (T306508) [16:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:48] T306508: ContentTranslation doesn't know that an article already exists in the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T306508 [16:34:54] yeah, the latency heatmap's interesting, something definitely happened out at the tail when codfw repooled [16:35:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2296.codfw.wmnet with reason: host reimage [16:35:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2297.codfw.wmnet with reason: host reimage [16:38:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2296.codfw.wmnet with reason: host reimage [16:39:22] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10679643 (10Krinkle) [16:40:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10679654 (10phaultfinder) [16:42:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2297.codfw.wmnet with reason: host reimage [16:43:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10679661 (10VRiley-WMF) a:05VRiley-WMF→03jijiki [16:45:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10679669 (10phaultfinder) [16:46:53] (03CR) 10Jdlrobson: [C:04-1] Set wgMinervaDonateBanner to default base true (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [16:51:02] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [16:52:40] (03CR) 10Ssingh: [C:03+1] upgrade cp6001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131367 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:42] (03CR) 10Ssingh: [C:03+1] upgrade cp6002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131368 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:43] (03CR) 10Ssingh: [C:03+1] upgrade cp6003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131369 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:45] (03CR) 10Ssingh: [C:03+1] upgrade cp6004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131370 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:46] (03CR) 10Ssingh: [C:03+1] upgrade cp6005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131371 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:52:54] (03CR) 10Ssingh: [C:03+1] upgrade cp6006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131372 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:58] (03CR) 10Ssingh: [C:03+1] upgrade cp6007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131373 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:02] (03CR) 10Ssingh: [C:03+1] upgrade cp6008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131374 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:06] (03CR) 10Ssingh: [C:03+1] upgrade cp6009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131375 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:10] (03CR) 10Ssingh: [C:03+1] upgrade cp6010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131376 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:14] (03CR) 10Ssingh: [C:03+1] upgrade cp6011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131377 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:18] (03CR) 10Ssingh: [C:03+1] upgrade cp6012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131378 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:22] (03CR) 10Ssingh: [C:03+1] upgrade cp6013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131379 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:26] (03CR) 10Ssingh: [C:03+1] upgrade cp6014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131380 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:30] (03CR) 10Ssingh: [C:03+1] upgrade cp6015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131381 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:53:34] (03CR) 10Ssingh: [C:03+1] upgrade cp6016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131382 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:55:36] (03PS1) 10Joal: Update analytics webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/1131387 (https://phabricator.wikimedia.org/T386177) [16:55:53] (03CR) 10BCornwall: [C:03+2] upgrade cp6001 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131367 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:56:08] (03PS2) 10BCornwall: upgrade cp6016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131382 (https://phabricator.wikimedia.org/T378737) [16:56:12] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6016 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131382 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:57:02] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679754 (10Jdlrobson-WMF) p:05High→03Medium [16:57:25] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6016.drmrs.wmnet} and A:cp [16:57:26] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6001.drmrs.wmnet} and A:cp [16:58:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:59:47] (03PS1) 10Elukey: mapnik: upgrade to upstream 4.0.6 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131388 (https://phabricator.wikimedia.org/T389776) [17:00:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to 17.8 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1700) [17:02:51] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10679800 (10ayounsi) just got off the phone with the tech, I made a small mistake it was port 1 on cr1, so he called me to double check. He is going to do the patching, updat... [17:03:22] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6001.drmrs.wmnet} and A:cp [17:03:23] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679816 (10Jdlrobson-WMF) 05In progress→03Stalled [17:03:58] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6016.drmrs.wmnet} and A:cp [17:04:35] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679823 (10Jdlrobson-WMF) [17:04:52] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices are (possibly) not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10679825 (10Jdlrobson-WMF) p:05Medium→03Low Lowering priority and stalling as there is nothing actionable here at this time and the numbers we saw do not... [17:05:23] 06SRE, 10Bitu, 06Infrastructure-Foundations, 10LDAP-Access-Requests: Disable BarryTheBrowserTestBot LDAP account - https://phabricator.wikimedia.org/T388662#10679836 (10hashar) Thank you @bd808 for the details and for the screenshot of the blocking logs! [17:05:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10679837 (10phaultfinder) [17:06:59] !log dancy@deploy1003 Installing scap version "4.144.3" for 2 host(s) [17:08:46] !log dancy@deploy1003 Installation of scap version "4.144.3" completed for 2 hosts [17:09:55] (03CR) 10Ottomata: Update analytics webrequest kafkatee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131387 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [17:12:01] 06SRE, 10vm-requests: eqiad: 1 VMs requested for Data Persistence automation - https://phabricator.wikimedia.org/T390087 (10FCeratto-WMF) 03NEW [17:12:43] 06SRE, 10vm-requests: eqiad: 1 VMs requested for Data Persistence automation - https://phabricator.wikimedia.org/T390087#10679900 (10FCeratto-WMF) [17:12:48] !log restart keyholder-proxy.service on deploy1003, deploy2002 to pick up the spiderpig deployment group change [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:31] (03CR) 10Brouberol: [C:03+1] elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:13:36] 06SRE, 10vm-requests: eqiad: 1 VMs requested for Data Persistence automation - https://phabricator.wikimedia.org/T390087#10679901 (10FCeratto-WMF) [17:14:21] (03PS3) 10Jdlrobson: Web features should not be ambiguously configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) [17:14:24] (03CR) 10Jdlrobson: Web features should not be ambiguously configured (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [17:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10679947 (10phaultfinder) [17:25:55] (03CR) 10Volans: [C:03+2] sre.ganeti.addnode: confirm verify on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) (owner: 10Volans) [17:27:37] !log dancy@deploy1003 Installing scap version "4.144.4" for 2 host(s) [17:29:24] !log dancy@deploy1003 Installation of scap version "4.144.4" completed for 2 hosts [17:32:14] (03CR) 10Btullis: elastic/cirrussearch: begin production migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:33:58] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [17:34:02] (03Merged) 10jenkins-bot: sre.ganeti.addnode: confirm verify on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1131363 (https://phabricator.wikimedia.org/T309724) (owner: 10Volans) [17:34:44] (03CR) 10CI reject: [V:04-1] updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [17:34:51] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [17:35:36] (03CR) 10CI reject: [V:04-1] update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [17:36:47] (03CR) 10Ebernhardson: [C:03+1] cirrus: use search-psi to point to opensearch cluster in the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [17:36:53] I've got a script to run to move some pages on officewiki – is there anything important going on infrastructure-wise? Would it make sense to wait until 18:00 UTC? [17:37:03] it looks like mostly upgrades to scap from here, and I won't be running scap [17:42:06] (03CR) 10BCornwall: [C:03+2] upgrade cp6002 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131368 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:42:24] (03PS2) 10BCornwall: upgrade cp6015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131381 (https://phabricator.wikimedia.org/T378737) [17:42:30] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6015 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131381 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:43:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2298.codfw.wmnet with OS bookworm [17:43:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2298.codfw.wmnet with OS... [17:44:46] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6015.drmrs.wmnet} and A:cp [17:44:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6002.drmrs.wmnet} and A:cp [17:48:28] (03PS1) 10Andrew Bogott: Update neutron policies [puppet] - 10https://gerrit.wikimedia.org/r/1131393 (https://phabricator.wikimedia.org/T389965) [17:49:01] (03CR) 10Andrew Bogott: [C:03+2] Update neutron policies [puppet] - 10https://gerrit.wikimedia.org/r/1131393 (https://phabricator.wikimedia.org/T389965) (owner: 10Andrew Bogott) [17:50:47] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6002.drmrs.wmnet} and A:cp [17:50:51] (03CR) 10Joal: Update analytics webrequest kafkatee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131387 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [17:51:40] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6015.drmrs.wmnet} and A:cp [17:53:40] (03CR) 10Btullis: [C:03+2] Update analytics webrequest kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/1131387 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [17:53:41] (03CR) 10Ottomata: Update analytics webrequest kafkatee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131387 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [17:53:56] (03CR) 10Jgleeson: [C:03+1] "+2ed original patch. LGTM!" [extensions/FundraiserLandingPage] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131348 (https://phabricator.wikimedia.org/T390032) (owner: 10Jforrester) [17:55:15] zip: nothing major going on atm afaik [17:55:22] jouncebot: nowandnext [17:55:22] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T1700) [17:55:22] In 2 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T2000) [17:55:35] in which case, "yolo", as the youths reportedly say [17:58:14] (03CR) 10BCornwall: "I've opened https://phabricator.wikimedia.org/T390094 for further discussion." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [17:59:40] FIRING: [2x] ProbeDown: Service logstash1030:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1030:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:29] RESOLVED: [2x] ProbeDown: Service logstash1030:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1030:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:59] 10ops-eqiad, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390064#10680205 (10phaultfinder) [18:14:46] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131399 (https://phabricator.wikimedia.org/T381544) [18:16:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10680221 (10phaultfinder) [18:16:45] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131399 (https://phabricator.wikimedia.org/T381544) (owner: 10DDesouza) [18:18:40] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131399 (https://phabricator.wikimedia.org/T381544) (owner: 10DDesouza) [18:20:36] (03CR) 10Ebernhardson: [C:03+1] wdqs: enable hive/hdfs ingestion for rdf update streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T388372) (owner: 10DCausse) [18:20:48] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:21:05] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:21:07] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:21:15] (03PS1) 10Btullis: Temporarily exclude an-worker1202 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) [18:21:27] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:21:28] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:21:46] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:24:21] (03PS2) 10Btullis: Temporarily exclude an-worker1202 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) [18:26:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [18:28:36] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1131301 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [18:28:41] (03PS1) 10Joal: Update hadoop-test webrequest gobblin/purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/1131405 (https://phabricator.wikimedia.org/T386177) [18:28:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:29:04] (03CR) 10CI reject: [V:04-1] Update hadoop-test webrequest gobblin/purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/1131405 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [18:31:42] (03CR) 10BCornwall: [C:03+2] upgrade cp6003 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131369 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:31:53] (03PS2) 10BCornwall: upgrade cp6014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131380 (https://phabricator.wikimedia.org/T378737) [18:32:01] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6014 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131380 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:32:53] (03PS2) 10Joal: Update hadoop-test webrequest gobblin/purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/1131405 (https://phabricator.wikimedia.org/T386177) [18:33:54] 10ops-eqiad, 06DC-Ops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10680258 (10VRiley-WMF) 05In progress→03Resolved a:03VRiley-WMF This has been removed and completed [18:35:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:35:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2296.codfw.wmnet with OS bookworm [18:35:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [18:35:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2297.codfw.wmnet with OS bookworm [18:35:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2296.codfw.wmnet with OS... [18:35:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2297.codfw.wmnet with OS... [18:35:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2298.codfw.wmnet with OS bookworm [18:36:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2298.codfw.wmnet with... [18:36:10] (03CR) 10Btullis: [C:03+1] elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:37:01] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6014.drmrs.wmnet} and A:cp [18:37:02] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6003.drmrs.wmnet} and A:cp [18:39:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:39:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:41:00] (03PS1) 10Zoe: Make officewiki readonly after moving flow pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131410 (https://phabricator.wikimedia.org/T380909) [18:43:03] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6003.drmrs.wmnet} and A:cp [18:43:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6014.drmrs.wmnet} and A:cp [18:46:12] (03CR) 10Stoyofuku-wmf: "Quick note about minerva features, then we should be ready!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [18:48:55] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10680288 (10Krinkle) [18:49:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:49:28] (03CR) 10Stoyofuku-wmf: "my bad for missing that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [18:49:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:02:02] _phew_! I am finally done moving Flow pages on officewiki [19:05:18] (03PS1) 10Ottomata: EventStreamConfig - keep geoip-* headers in eventgate-logging-external streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131413 (https://phabricator.wikimedia.org/T387908) [19:07:04] (03CR) 10Ebernhardson: [C:03+2] Add opensearch-knn [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1131068 (https://phabricator.wikimedia.org/T389812) (owner: 10DCausse) [19:08:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131413 (https://phabricator.wikimedia.org/T387908) (owner: 10Ottomata) [19:12:58] (03PS4) 10Jdlrobson: Web features should not be ambiguously configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) [19:13:01] (03CR) 10Jdlrobson: Web features should not be ambiguously configured (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [19:17:09] (03PS1) 10Ottomata: eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131415 (https://phabricator.wikimedia.org/T383814) [19:21:36] (03PS1) 10Ahmon Dancy: Revert "P:idp Limit groups sent from CAS to Spiderpig" [puppet] - 10https://gerrit.wikimedia.org/r/1131419 [19:21:48] (03CR) 10RLazarus: [C:03+1] "LGTM when you're ready!" [puppet] - 10https://gerrit.wikimedia.org/r/1131351 (https://phabricator.wikimedia.org/T387917) (owner: 10Scott French) [19:22:29] (03PS2) 10Ahmon Dancy: Revert "P:idp Limit groups sent from CAS to Spiderpig" [puppet] - 10https://gerrit.wikimedia.org/r/1131419 (https://phabricator.wikimedia.org/T389869) [19:22:45] (03CR) 10BCornwall: [C:03+2] upgrade cp6004 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131370 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:22:57] (03PS2) 10BCornwall: upgrade cp6013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131379 (https://phabricator.wikimedia.org/T378737) [19:23:01] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6013 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131379 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:23:28] (03CR) 10DLynch: [C:03+1] Make officewiki readonly after moving flow pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131410 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [19:24:13] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6004.drmrs.wmnet} and A:cp [19:24:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6013.drmrs.wmnet} and A:cp [19:24:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:24:39] rzl: Would you be willing to +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131419 to help unbreak spiderpig? [19:24:49] looking [19:27:16] oh, I see [19:27:18] yeah, can do [19:27:29] Excellent. [19:27:31] sorry I can't give you any more nuanced help than that :P [19:27:39] (03CR) 10RLazarus: [C:03+2] Revert "P:idp Limit groups sent from CAS to Spiderpig" [puppet] - 10https://gerrit.wikimedia.org/r/1131419 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [19:27:42] Moritz said he'd help out tomorrow. [19:27:53] cool [19:28:23] Thank you very much! [19:28:53] not merged at the puppetserver yet, one sec while I figure out why I can't ssh [19:29:50] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6004.drmrs.wmnet} and A:cp [19:30:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:30:16] (03PS3) 10Jdlrobson: Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) [19:30:51] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6013.drmrs.wmnet} and A:cp [19:34:32] dancy: merged now, and ran puppet on the idp hosts [19:34:40] Ok. testing. [19:34:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 26 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [19:35:21] And we're back! Thanks rzl! [19:35:23] \o/ [19:46:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:47:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2298.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:57:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2299.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:57:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2300.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:58:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2299.codfw.wmnet with OS bookworm [19:58:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2299.codfw.wmnet with... [19:58:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2300.codfw.wmnet with OS bookworm [19:58:37] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2300.codfw.wmnet with... [19:58:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2298.codfw.wmnet with OS bookworm [19:58:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10680485 (10VRiley-WMF) Replaced the drives in an-worker1183, 1184, 1185 [19:59:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2298.codfw.wmnet with... [19:59:19] (03PS7) 10Bking: elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) [20:00:08] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and thcipriani: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T2000) [20:00:08] kemayo, kimberly_sarabia, and inflatador: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:00:42] DIBS! I can deploy [20:01:23] o/ [20:01:54] Hello [20:02:21] I won’t be able to test mine — it’s for an a/b test on wikis that .22 hasn’t rolled out to yet. [20:02:36] alright, so let's do the config patches first and get CI for VE rolling in the mean time [20:02:58] so kimberly_sarabia that would mean you're up first [20:03:39] thcipriani: So sorry. I missed the -1. You can skip me for this window [20:03:46] My patch is only touching beta cluster, so no need to test [20:03:57] (Unless debug lets me see that. I’ve never quite been clear on that interaction.) [20:04:22] kimberly_sarabia: no problem, we'll move on to inflatador [20:05:06] inflatador: in that case yours should be quick [20:05:46] {◕ ◡ ◕} [20:06:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by spiderpig@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [20:06:57] (03Merged) 10jenkins-bot: cirrus: use search-psi to point to opensearch cluster in the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131333 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [20:08:27] inflatador: next run of https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ should update beta cluster (happens every 10 minutes). you're done! Thanks for flying the UTC late window <3 [20:08:47] Kemayo: you're up [20:08:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by spiderpig@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131361 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [20:09:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2299.codfw.wmnet with reason: host reimage [20:10:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2298.codfw.wmnet with reason: host reimage [20:10:57] (03Merged) 10jenkins-bot: Edit check: in single action mode the fixed sidebar isn't allowed null offset [extensions/VisualEditor] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131361 (https://phabricator.wikimedia.org/T389906) (owner: 10DLynch) [20:11:22] !log spiderpig@deploy1003 Started scap sync-world: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] [20:11:26] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [20:11:28] (03CR) 10BCornwall: [C:03+2] upgrade cp6005 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131371 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:11:41] (03PS2) 10BCornwall: upgrade cp6012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131378 (https://phabricator.wikimedia.org/T378737) [20:11:45] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6012 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131378 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:12:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2299.codfw.wmnet with reason: host reimage [20:13:35] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6005.drmrs.wmnet} and A:cp [20:13:35] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6012.drmrs.wmnet} and A:cp [20:15:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2298.codfw.wmnet with reason: host reimage [20:16:05] !log spiderpig@deploy1003 spiderpig, kemayo: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:16:30] (03PS2) 10Kimberly Sarabia: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) [20:17:00] Kemayo: .22 is on test wikis if it can be tested there; otherwise i can proceed [20:17:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680548 (10Jhancock.wm) [20:17:17] (03CR) 10Kimberly Sarabia: Set wgMinervaDonateBanner to default base true (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [20:17:25] brennen: Let me see whether it'll show up on the groups 22 isn't actually on yet. [20:18:46] brennen: It doesn't, so I cannot actually test this. Go ahead and roll it out. [20:19:14] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6005.drmrs.wmnet} and A:cp [20:19:17] Kemayo: kk, going ahead. [20:19:24] But I have now conclusively answered that question for myself about how the debug servers work in this situation. 🤩 [20:19:27] !log spiderpig@deploy1003 spiderpig, kemayo: Continuing with sync [20:19:49] I'd have been able to test it if the train hadn't been held back this morning, but alas... [20:20:06] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6012.drmrs.wmnet} and A:cp [20:24:34] (03CR) 10Stoyofuku-wmf: [C:03+1] "@jrobson@wikimedia.org I would not be offended if you double checked my work, but this looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [20:25:50] (03CR) 10Stoyofuku-wmf: [C:03+1] "Thank you for your patience!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [20:26:25] !log spiderpig@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] (duration: 15m 03s) [20:26:33] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [20:26:45] aaaand that's a wrap [20:26:51] !log end of UTC late backport & config window [20:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:03] !log first successfull spiderpig backport window [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:29:24] !log dancy@deploy1003 Installing scap version "4.144.5" for 2 host(s) [20:31:11] !log dancy@deploy1003 Installation of scap version "4.144.5" completed for 2 hosts [20:32:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:32:45] Kemayo: I'm going to use your https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1131361 change to do another spiderpig test. You can ignore the messages [20:32:58] dancy: Thanks for the warning :D [20:33:37] !log spiderpig@deploy1003 Started scap sync-world: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] [20:33:41] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [20:34:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:34:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2299.codfw.wmnet with OS bookworm [20:34:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:34:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680617 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2299.codfw.wmnet with OS... [20:34:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2298.codfw.wmnet with OS bookworm [20:34:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2298.codfw.wmnet with OS... [20:37:03] (03PS1) 10Ebernhardson: beta cluster: Add eqiad-opensearch to cirrus writable clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131440 (https://phabricator.wikimedia.org/T389971) [20:37:28] Since backport finished early i'm going to run an extra patch now, it only effects beta cluster and is a no-op for prod [20:37:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:07] (03CR) 10Bking: [C:03+1] beta cluster: Add eqiad-opensearch to cirrus writable clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131440 (https://phabricator.wikimedia.org/T389971) (owner: 10Ebernhardson) [20:38:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:20] !log spiderpig@deploy1003 kemayo, spiderpig: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:34] !log spiderpig@deploy1003 kemayo, spiderpig: Continuing with sync [20:40:12] ebernhardson: Feel free to jump in after mine completes [20:40:27] dancy: awesome, thanks [20:45:33] !log spiderpig@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131361|Edit check: in single action mode the fixed sidebar isn't allowed null offset (T389906)]] (duration: 11m 55s) [20:45:37] T389906: [Regression] Broken workflow after closing the "Add a citation" dialog prompted from a check and then trying to save the edit - https://phabricator.wikimedia.org/T389906 [20:45:49] ebernhardson: You're up [20:45:55] dancy: excellent, thanks [20:48:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:48:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131440 (https://phabricator.wikimedia.org/T389971) (owner: 10Ebernhardson) [20:48:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:18] (03Merged) 10jenkins-bot: beta cluster: Add eqiad-opensearch to cirrus writable clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131440 (https://phabricator.wikimedia.org/T389971) (owner: 10Ebernhardson) [20:50:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10680652 (10phaultfinder) [20:51:19] all done [20:59:18] (03PS1) 10Gergő Tisza: Fix badpass logging for locally nonexistent users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T2100) [21:00:42] (03CR) 10Gergő Tisza: Fix badpass logging for locally nonexistent users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [21:02:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:02:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:04:54] (03PS1) 10Bking: elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [21:07:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2302.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:08:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2301.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:10:10] (03PS2) 10Ryan Kemper: elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:11:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2300.codfw.wmnet with OS bookworm [21:11:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2300.codfw.wmnet with OS... [21:11:23] (03PS1) 10Bking: cirrussearch: use EFI for soon-to-be-reimaged Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131447 (https://phabricator.wikimedia.org/T388610) [21:14:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2301.codfw.wmnet with OS bookworm [21:14:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2302.codfw.wmnet with OS bookworm [21:14:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2301.codfw.wmnet with... [21:14:48] (03PS2) 10Ryan Kemper: cirrussearch: use EFI for soon-to-be-reimaged Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131447 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:15:00] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131447 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:17:00] (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:19:19] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: begin production migration [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:19:35] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: begin production migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131087 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:22:41] (03PS3) 10Bking: elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [21:23:55] (03PS1) 10Hashar: Adjust build.sh for other environments [software/bitu] - 10https://gerrit.wikimedia.org/r/1131452 [21:24:54] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: use EFI for soon-to-be-reimaged Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131447 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:25:01] (03CR) 10Bking: [C:03+2] cirrussearch: use EFI for soon-to-be-reimaged Elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131447 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:26:04] (03PS4) 10Bking: elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [21:29:33] (03PS1) 10Hashar: tox: allow passing arguments to django/flake8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1131453 [21:30:00] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:30:20] !log drain transport circuit CRT-008647 T389071 [21:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:25] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [21:30:51] (03CR) 10BCornwall: [C:03+2] upgrade cp6006 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131372 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:31:06] (03PS2) 10BCornwall: upgrade cp6011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131377 (https://phabricator.wikimedia.org/T378737) [21:31:10] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6011 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131377 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:32:06] (03PS1) 10Bking: relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) [21:32:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6011.drmrs.wmnet} and A:cp [21:32:21] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6006.drmrs.wmnet} and A:cp [21:34:01] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [21:34:15] !log enable 'graceful shutdown' mode for bgp on cr1-drmrs T389071 [21:34:16] (03PS2) 10Bking: relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) [21:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for elastic - jclark@cumin1002" [21:34:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1123.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:34:42] (03PS3) 10Bking: relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) [21:35:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [21:36:35] (03PS4) 10Ryan Kemper: relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [21:37:59] (03CR) 10Ryan Kemper: [C:03+1] relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [21:38:06] (03CR) 10Bking: [C:03+2] relforge: bring new hosts online (again) [puppet] - 10https://gerrit.wikimedia.org/r/1131454 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [21:38:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6006.drmrs.wmnet} and A:cp [21:38:09] (03PS2) 10Hashar: tox: allow passing arguments to django/flake8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1131453 [21:38:09] (03PS1) 10Hashar: tox: consolidate flake8 config to a single location [software/bitu] - 10https://gerrit.wikimedia.org/r/1131455 [21:38:26] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6011.drmrs.wmnet} and A:cp [21:38:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:39:42] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:40:27] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1125 [21:40:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1125 [21:41:52] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1124 [21:43:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1124 [21:44:22] (03PS1) 10Bking: relforge: add new hosts to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1131458 (https://phabricator.wikimedia.org/T389957) [21:44:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131458 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [21:45:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1123.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:46:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:46:44] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10680888 (10Jclark-ctr) [21:47:06] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10680889 (10Jclark-ctr) a:03Jclark-ctr [21:47:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1123.eqiad.wmnet with OS bullseye [21:47:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10680900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye [21:47:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:49:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10680928 (10Jhancock.wm) [21:51:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [21:53:53] (03PS1) 10Hashar: Modernize the way integrations are called [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [21:55:15] !log disabling external Internet peers in BGP on cr1-drmrs T389071 [21:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:19] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [21:59:56] (03CR) 10Ebernhardson: [C:03+1] relforge: add new hosts to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1131458 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [22:00:02] (03CR) 10Ryan Kemper: [C:03+1] relforge: add new hosts to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1131458 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [22:00:04] (03CR) 10Bking: [C:03+2] relforge: add new hosts to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/1131458 (https://phabricator.wikimedia.org/T389957) (owner: 10Bking) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250326T2200) [22:01:27] we will be using the window today! [22:01:34] ...in a bit [22:01:36] !log resetting PIC0 on cr1-drmrs to enable et-0/0/1 T389071 [22:01:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:41] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [22:03:59] o/ [22:04:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:04:42] we're talking about whether to talk about the thing this 1:1 was originally about first or deploy first [22:05:15] (03CR) 10BCornwall: [C:03+2] upgrade cp6007 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131373 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:05:25] (03PS2) 10BCornwall: upgrade cp6010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131376 (https://phabricator.wikimedia.org/T378737) [22:05:29] (03CR) 10BCornwall: [V:03+2 C:03+2] upgrade cp6010 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131376 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:05:33] starting deploys! Doing Jon's patch first [22:05:45] (03PS3) 10Jdlrobson: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:06:09] (03PS4) 10Jdlrobson: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:06:16] (03CR) 10Jdlrobson: [C:03+1] Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:06:52] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6010.drmrs.wmnet} and A:cp [22:06:53] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6007.drmrs.wmnet} and A:cp [22:07:25] FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:08:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:08:35] (03Merged) 10jenkins-bot: Web features should not be ambiguously configured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:09:00] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1130771|Web features should not be ambiguously configured (T388445)]] [22:09:04] T388445: [Spike, 1 day] Analyze/analyse usage of base in mobile feature management - https://phabricator.wikimedia.org/T388445 [22:10:47] !log reset configuration on cr1-drmrs to enable external connections T389071 [22:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:52] T389071: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071 [22:12:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2310 to codfw - jhancock@cumin2002" [22:12:07] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6007.drmrs.wmnet} and A:cp [22:12:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2310 to codfw - jhancock@cumin2002" [22:12:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:12:25] FIRING: [7x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:13:46] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6010.drmrs.wmnet} and A:cp [22:14:13] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10681006 (10cmooney) p:05High→03Low Happy to say all looks good following the replacement patch and optics being installed this evening: ` cmooney@cr1-drmrs> show interfa... [22:15:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:15:34] !log toyofuku@deploy1003 toyofuku, jdlrobson: Backport for [[gerrit:1130771|Web features should not be ambiguously configured (T388445)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:15:38] T388445: [Spike, 1 day] Analyze/analyse usage of base in mobile feature management - https://phabricator.wikimedia.org/T388445 [22:17:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:31] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2310 [22:17:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:17:38] we're testing in a google meet call [22:17:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2310 [22:17:52] ok! [22:18:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:18:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1125.eqiad.wmnet with OS bullseye [22:18:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:18:43] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10681040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1125.eqiad.wmnet with OS bullseye [22:18:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:20:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:22:05] (03PS2) 10Hashar: Simplify the list of client integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [22:22:25] FIRING: [7x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:24:02] (03CR) 10SBassett: [C:03+1] Fix badpass logging for locally nonexistent users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [22:25:37] !log toyofuku@deploy1003 toyofuku, jdlrobson: Continuing with sync [22:26:13] we noted one small followup we want to make, but I'll deploy Kim's first [22:26:37] So the order is now: [22:26:37] 1. finish Jon's [22:26:37] 2. deploy Kim's [22:26:37] 3. deploy Jon's part 2 [22:26:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2302.codfw.wmnet with OS bookworm [22:26:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2301.codfw.wmnet with OS bookworm [22:26:49] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10681046 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2302.codfw.wmnet with OS... [22:26:52] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10681047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2301.codfw.wmnet with OS... [22:26:58] (03PS1) 10Jdlrobson: Restore simplified watchlist for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) [22:27:06] (03PS3) 10Hashar: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [22:27:07] (03CR) 10CI reject: [V:04-1] Restore simplified watchlist for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:27:17] (03PS5) 10Jdlrobson: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:27:20] (03PS2) 10Jdlrobson: Restore simplified watchlist for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) [22:28:07] (03CR) 10Stoyofuku-wmf: [C:03+1] "nothing to see here" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:28:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:29:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:31:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:31:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:32:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1124.eqiad.wmnet with OS bullseye [22:32:32] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10681059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1124.eqiad.wmnet with OS bullseye [22:32:41] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130771|Web features should not be ambiguously configured (T388445)]] (duration: 23m 41s) [22:32:45] T388445: [Spike, 1 day] Analyze/analyse usage of base in mobile feature management - https://phabricator.wikimedia.org/T388445 [22:33:34] Onward to Kim's [22:33:51] kimberly_sarabia: courtesy tag [22:34:03] o/ [22:35:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:35:36] will tag you again when it's on test servers [22:35:41] Jon and I will also be testing [22:35:53] 3 web engineers = 1 edward I hope [22:36:22] (03Merged) 10jenkins-bot: Set wgMinervaDonateBanner to default base true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131365 (https://phabricator.wikimedia.org/T388438) (owner: 10Kimberly Sarabia) [22:36:42] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1131365|Set wgMinervaDonateBanner to default base true (T388438)]] [22:36:47] T388438: Gradual Rollout - Donate Button Deployment - https://phabricator.wikimedia.org/T388438 [22:37:25] RESOLVED: [4x] SystemdUnitFailed: opensearch-disable-readahead.service on relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [22:42:03] 1 of the 12 test servers seems a bit slower than most [22:42:21] not stuck but noticeably we keep getting hung on 11/12 for a while [22:42:27] noting for nobody in particular! [22:42:58] !log toyofuku@deploy1003 ksarabia, toyofuku: Backport for [[gerrit:1131365|Set wgMinervaDonateBanner to default base true (T388438)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:43:03] T388438: Gradual Rollout - Donate Button Deployment - https://phabricator.wikimedia.org/T388438 [22:43:14] kimberly_sarabia: we're on testservers! [22:44:07] toyofuku: LGTM [22:44:13] looks good to us too! [22:44:20] Any reason not to proceed? [22:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681101 (10phaultfinder) [22:44:46] Can't think of anything. Feel free to proceed [22:44:52] yay thank you! [22:44:55] !log toyofuku@deploy1003 ksarabia, toyofuku: Continuing with sync [22:45:58] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10681105 (10Jclark-ctr) [22:47:44] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10681110 (10Jclark-ctr) a:05Jclark-ctr→03bking @bking we are missing these servers in site.pp if you can update puppet so they can be reimaged [22:49:03] (03CR) 10BCornwall: [C:03+2] upgrade cp6008 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131374 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:49:05] (03CR) 10BCornwall: [C:03+2] upgrade cp6009 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131375 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [22:50:25] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6009.drmrs.wmnet} and A:cp [22:50:27] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp6008.drmrs.wmnet} and A:cp [22:51:57] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131365|Set wgMinervaDonateBanner to default base true (T388438)]] (duration: 15m 15s) [22:52:02] T388438: Gradual Rollout - Donate Button Deployment - https://phabricator.wikimedia.org/T388438 [22:52:06] kimberly_sarabia: we're live! [22:53:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:53:11] doing that final patch [22:53:43] toyofuku: LGTM tyty! [22:53:52] (03Merged) 10jenkins-bot: Restore simplified watchlist for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131469 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:54:16] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1131469|Restore simplified watchlist for logged in users (T388445)]] [22:54:21] T388445: [Spike, 1 day] Analyze/analyse usage of base in mobile feature management - https://phabricator.wikimedia.org/T388445 [22:56:25] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6008.drmrs.wmnet} and A:cp [22:57:00] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp6009.drmrs.wmnet} and A:cp [22:58:25] 11/12 servers finished k8s deployment in under 20 seconds, the last one took two minutes [22:58:39] Not sure how much of a big deal to make of that but at least I have shouted it into the void [22:58:41] !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1131469|Restore simplified watchlist for logged in users (T388445)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:59:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1123.eqiad.wmnet with OS bullseye [22:59:23] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10681125 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye executed with errors: - ela... [22:59:28] !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync [23:06:01] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10681143 (10phaultfinder) [23:06:45] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131469|Restore simplified watchlist for logged in users (T388445)]] (duration: 12m 29s) [23:06:50] T388445: [Spike, 1 day] Analyze/analyse usage of base in mobile feature management - https://phabricator.wikimedia.org/T388445 [23:06:55] we're all done! [23:07:24] thanks everyone - noting one last time that it seems like we have a half-stuck host in the testserver pool, but I am far from an expert in that area [23:08:19] ty! [23:09:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [23:16:33] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10681160 (10Eevans) Is there anyone that thinks that #acl_security is a //worse// choice of default visibility? [23:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681166 (10phaultfinder) [23:27:04] (03PS1) 10Eevans: corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1131479 (https://phabricator.wikimedia.org/T389664) [23:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:27:55] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10681178 (10phaultfinder) [23:37:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:44:31] (03PS1) 10Gergő Tisza: Enable SUL3 login for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131480 (https://phabricator.wikimedia.org/T384219) [23:46:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131480 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [23:47:29] (03PS1) 10Gergő Tisza: Enable SUL3 for temp users on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131481 (https://phabricator.wikimedia.org/T384220) [23:47:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131481 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [23:49:43] (03CR) 10RLazarus: [C:03+1] corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1131479 (https://phabricator.wikimedia.org/T389664) (owner: 10Eevans) [23:52:08] (03PS1) 10Gergő Tisza: Disable new WebAuthn credentials creation on local domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131482 (https://phabricator.wikimedia.org/T378402) [23:54:58] (03CR) 10Krinkle: Fix wgCirrusSearchSimilarityProfiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [23:55:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10681272 (10Jhancock.wm) @Clement_Goubert hey i need a little favor. i noticed a missing range in this on site.pp. node /^wikikube-worker23([1-2... [23:56:12] (03PS1) 10Jdlrobson: Deploy dark mode and Vector 2022 to German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131483 (https://phabricator.wikimedia.org/T387155) [23:57:25] (03PS1) 10Jdlrobson: Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131484 (https://phabricator.wikimedia.org/T390112)