[00:00:39] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:03:01] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126 (10RLazarus) 03NEW [00:09:14] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1131052 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [00:20:39] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:34:06] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10681315 (10Scott_French) A quick spot check this afternoon after we re-pooled codfw around 14:... [00:38:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131487 [00:38:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131487 (owner: 10TrainBranchBot) [00:50:02] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10681346 (10phaultfinder) [00:50:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1131487 (owner: 10TrainBranchBot) [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131490 [01:08:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131490 (owner: 10TrainBranchBot) [01:19:57] (03CR) 10Bartosz Dziewoński: [C:03+1] Disable new WebAuthn credentials creation on local domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131482 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [01:26:11] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1131490 (owner: 10TrainBranchBot) [02:03:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) (owner: 10Hubaishan) [02:28:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:29:48] 06SRE, 10MW-on-K8s, 06serviceops: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056#10681399 (10Krinkle) [02:33:48] (03CR) 10Zabe: [C:03+1] Fix badpass logging for locally nonexistent users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [02:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681418 (10phaultfinder) [02:57:11] (03PS2) 10Robertsky: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) [02:58:05] (03PS3) 10Robertsky: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) [02:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681419 (10phaultfinder) [03:01:16] (03PS5) 10Robertsky: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [03:02:48] (03CR) 10Robertsky: [C:03+1] updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:04:57] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:05:45] (03CR) 10CI reject: [V:04-1] updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:08:03] (03CR) 10Anzx: updating wikimaniawiki namespace configurations: (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:15:36] (03CR) 10Anzx: update wikimaniawiki perms configurations: (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:17:25] (03PS6) 10Robertsky: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [03:18:28] (03CR) 10Robertsky: updating wikimaniawiki namespace configurations: (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:21:35] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:22:17] (03CR) 10CI reject: [V:04-1] updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:22:58] (03PS4) 10Robertsky: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) [03:23:07] (03CR) 10Robertsky: update wikimaniawiki perms configurations: (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:23:14] (03CR) 10Anzx: [C:04-1] "`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:25:24] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:25:36] (03PS7) 10Robertsky: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) [03:27:10] (03CR) 10Anzx: [C:03+1] update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:30:37] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [03:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681438 (10phaultfinder) [03:59:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681445 (10phaultfinder) [04:04:53] (03CR) 10Anzx: [C:03+1] updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [04:06:30] !log restart grafana-server on grafana1002 - appears hung [04:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:40] (03CR) 10Anzx: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [04:06:43] (03CR) 10Anzx: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [04:07:01] FIRING: DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:12:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681451 (10phaultfinder) [04:32:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:24:13] Ehm [05:24:32] I just notice that a user that was recently indef. at es.wikipedia is now globally locked as a Compromised Account [05:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681484 (10phaultfinder) [05:24:39] and T389727 is listed in the summary [05:24:52] ¿is it a security bug or something like that? [05:25:01] It's a secret ticket so I can not check it [05:25:31] I just ran a CU-check to the account and found nothing suspicious [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0600) [06:00:05] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0600). [06:01:24] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2181.codfw.wmnet onto db2242.codfw.wmnet [06:01:35] LuchoCR: that is a security bug yes [06:05:37] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10681504 (10Marostegui) >>! In T388684#10678685, @Jhancock.wm wrote: > @Marostegui pulled one! Thank you @Jhancock.wm - you can go ahead and place it back! [06:08:06] (03PS1) 10Marostegui: installserver: Do not reimage db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1131500 (https://phabricator.wikimedia.org/T381475) [06:10:23] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2242 [puppet] - 10https://gerrit.wikimedia.org/r/1131500 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681549 (10phaultfinder) [06:28:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:28:22] (03PS2) 10Muehlenhoff: Stop including bullseye-backports on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130082 (https://phabricator.wikimedia.org/T383557) [07:30:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [07:30:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10681573 (10ops-monitoring-bot) Draining ganeti4005.ulsfo.wmnet of running VMs [07:30:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130082 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [07:30:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [07:32:40] (03PS1) 10Marostegui: mariadb: Productionize db1255 [puppet] - 10https://gerrit.wikimedia.org/r/1131626 (https://phabricator.wikimedia.org/T381475) [07:33:32] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1255 [puppet] - 10https://gerrit.wikimedia.org/r/1131626 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [07:34:52] (03CR) 10Krinkle: Web features should not be ambiguously configured (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [07:37:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [07:37:36] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10681583 (10ops-monitoring-bot) Draining ganeti4005.ulsfo.wmnet of running VMs [07:42:33] (03PS3) 10Slyngshede: C:raid:perccli do not error out if controller is no in use [puppet] - 10https://gerrit.wikimedia.org/r/1126542 [07:42:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/FundraiserLandingPage] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131348 (https://phabricator.wikimedia.org/T390032) (owner: 10Jforrester) [07:43:08] (03CR) 10Muehlenhoff: [C:03+2] Stop including bullseye-backports on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/1130082 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [07:43:21] (03PS1) 10Zoe: Archive user talk pages even if the userpage doesn't exist [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131627 (https://phabricator.wikimedia.org/T380911) [07:43:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131627 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [07:43:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131410 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [07:48:48] (03PS1) 10Muehlenhoff: autoinstall: Remove some Ubuntu traces [puppet] - 10https://gerrit.wikimedia.org/r/1131628 [07:54:04] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [07:55:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681616 (10phaultfinder) [07:57:37] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10681617 (10MoritzMuehlenhoff) All uses of bullseye-backports in Puppet have been remove and I've merged a patch so that bullseye-backports is no longer added to the apt co... [07:57:51] (03PS1) 10Muehlenhoff: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0800). [08:00:05] hubaishan, andre, and zip: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:56] morning [08:01:19] hej hej! [08:03:55] I'm still metaphorically in my dressing gown sipping coffee, but my first change is a script (backporting to wmf.21) and my other is a config change making Flow read-only on officewiki [08:04:10] (less metaphorically I'm dressed but pre-coffee) [08:05:42] (03PS1) 10Muehlenhoff: Update the 1.19 image to be based on Bookworm, not bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131631 (https://phabricator.wikimedia.org/T383557) [08:06:31] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [08:07:22] I wonder who is around to deploy? I mean, technically I could... [08:07:34] I was starting to ponder that as well [08:08:08] the number of backports I've done lately to this damned script, I figure I may as well sign up for training. I'm not sure I'm quite ready to do other people's patches yet [08:08:14] ehehe [08:08:35] may I go ahead with "my" patch (as it blocks the train to deploy in the window right after)? [08:08:53] oh well, could also start with your two and then you're free. Yeah that makes more sense [08:13:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aklapper@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131410 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [08:14:14] (03Merged) 10jenkins-bot: Make officewiki readonly after moving flow pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131410 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [08:14:29] zip: not sure backporting 1131627 to .21 is needed as .21 will ideally be gone within the next 2h anyway and there will be .22 everywhere [08:14:49] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1131410|Make officewiki readonly after moving flow pages (T380909)]] [08:14:53] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [08:15:33] andre: ah, did the blocker expire? [08:15:38] or get resolved i mean [08:16:28] oh, reading from bottom to top wasn't helpful here. I see you're about to resolve that [08:16:31] zip: well I need to backport the blocker fix too in the next minutes and then proceed with the two remaining train groups :) [08:17:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:17:15] I suppose it doesn't matter if my patch is on the train in an hour or two anyway, as long as the train really does go out [08:17:34] I'm trying to get this task closed before I go on vacation next week, then it's the offsite after that... [08:19:38] I am here for  1131018 [08:21:52] !log aklapper@deploy1003 zoe, aklapper: Backport for [[gerrit:1131410|Make officewiki readonly after moving flow pages (T380909)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:21:57] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [08:21:58] !log aklapper@deploy1003 zoe, aklapper: Continuing with sync [08:24:56] Confirming: flow now readonly on officewiki [08:25:03] well, on the canary [08:25:35] zip: nice, thanks for checking [08:25:42] zip: Noted, understandable. In case I need to roll back the train I'll backport/deploy your second change, let's say [08:25:59] but ideally everything is on .22 soon (phamous last words) [08:26:44] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134 (10dcaro) 03NEW [08:26:47] appreciated [08:27:25] right then, coffee run time [08:27:55] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10681659 (10dcaro) p:05Triage→03High [08:28:13] !log depooling cp7001 and cp7009 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131052 (T384227) [08:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:18] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [08:28:30] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [08:28:37] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7009.magru.wmnet [08:28:44] (03CR) 10Fabfur: [C:03+2] haproxy: use volatile storage for 2 hosts on magru [puppet] - 10https://gerrit.wikimedia.org/r/1131052 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:28:46] hubaishan: I'll backport yours next [08:29:03] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131410|Make officewiki readonly after moving flow pages (T380909)]] (duration: 14m 14s) [08:29:07] T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909 [08:29:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aklapper@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) (owner: 10Hubaishan) [08:30:17] (03Merged) 10jenkins-bot: Allow arwikisource bureaucrat to manage "import" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131018 (https://phabricator.wikimedia.org/T389952) (owner: 10Hubaishan) [08:30:38] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1131018|Allow arwikisource bureaucrat to manage "import" (T389952)]] [08:30:43] T389952: Allow arwikisource bureaucrat to add and remove "import" usergroup - https://phabricator.wikimedia.org/T389952 [08:32:56] (03PS1) 10Muehlenhoff: Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1131632 (https://phabricator.wikimedia.org/T381274) [08:34:25] (03PS1) 10Chuckonwumelu: Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 [08:35:03] (03CR) 10CI reject: [V:04-1] Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 (owner: 10Chuckonwumelu) [08:35:07] OK in debug server [08:37:01] !log aklapper@deploy1003 hubaishan, aklapper: Backport for [[gerrit:1131018|Allow arwikisource bureaucrat to manage "import" (T389952)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:37:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:37:05] T389952: Allow arwikisource bureaucrat to add and remove "import" usergroup - https://phabricator.wikimedia.org/T389952 [08:37:10] !log aklapper@deploy1003 hubaishan, aklapper: Continuing with sync [08:38:28] OK [08:40:25] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10681690 (10dcaro) [08:40:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [08:41:16] !log repooling cp7001 and cp7009 with new TLS certificate path (T384227) [08:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:20] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [08:41:23] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7009.magru.wmnet [08:41:29] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [08:41:55] 10ops-drmrs: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389848#10681700 (10phaultfinder) [08:42:36] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti4005.ulsfo.wmnet with reason: remove from cluster for reimage [08:42:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10681701 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=652b6b7b-5164-4a67-b73d-931451743ac2) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and the... [08:44:06] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131018|Allow arwikisource bureaucrat to manage "import" (T389952)]] (duration: 13m 28s) [08:44:11] T389952: Allow arwikisource bureaucrat to add and remove "import" usergroup - https://phabricator.wikimedia.org/T389952 [08:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681703 (10phaultfinder) [08:44:57] hubaishan: done. And thanks for checking :) [08:45:27] Thank you [08:45:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aklapper@deploy1003 using scap backport" [extensions/FundraiserLandingPage] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131348 (https://phabricator.wikimedia.org/T390032) (owner: 10Jforrester) [08:45:43] (03PS1) 10Muehlenhoff: Switch ganeti4005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1131634 [08:47:13] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti4005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1131634 (owner: 10Muehlenhoff) [08:47:52] (03PS1) 10Brouberol: mediawiki-dumps-legacy: mount the dumps configuration in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131635 [08:48:30] (03Merged) 10jenkins-bot: Instead of calling deprecated parserOptions(), parse content ourselves [extensions/FundraiserLandingPage] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131348 (https://phabricator.wikimedia.org/T390032) (owner: 10Jforrester) [08:48:56] !log aklapper@deploy1003 Started scap sync-world: Backport for [[gerrit:1131348|Instead of calling deprecated parserOptions(), parse content ourselves (T390032)]] [08:49:01] T390032: PHP Deprecated: Use of MediaWiki\Output\OutputPage::parserOptions was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\FundraiserLandingPage\Specials\FundraiserLandingPage::execute] - https://phabricator.wikimedia.org/T390032 [08:50:09] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4005.ulsfo.wmnet [08:51:44] (03PS1) 10Alexandros Kosiaris: mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) [08:53:45] !log aklapper@deploy1003 aklapper, jforrester: Backport for [[gerrit:1131348|Instead of calling deprecated parserOptions(), parse content ourselves (T390032)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:54:28] !log aklapper@deploy1003 aklapper, jforrester: Continuing with sync [09:00:04] andre and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0900). [09:01:20] !log aklapper@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131348|Instead of calling deprecated parserOptions(), parse content ourselves (T390032)]] (duration: 12m 24s) [09:01:25] T390032: PHP Deprecated: Use of MediaWiki\Output\OutputPage::parserOptions was deprecated in MediaWiki 1.44. [Called from MediaWiki\Extension\FundraiserLandingPage\Specials\FundraiserLandingPage::execute] - https://phabricator.wikimedia.org/T390032 [09:02:26] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131638 (https://phabricator.wikimedia.org/T386217) [09:02:27] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131638 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:03:15] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131638 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:03:52] (03CR) 10Elukey: "In production-images I see the following:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:04:40] (03CR) 10Elukey: [C:03+1] Update the 1.19 image to be based on Bookworm, not bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131631 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:05:02] (03CR) 10Elukey: [C:03+1] autoinstall: Remove some Ubuntu traces [puppet] - 10https://gerrit.wikimedia.org/r/1131628 (owner: 10Muehlenhoff) [09:05:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681752 (10phaultfinder) [09:07:40] (03PS1) 10Slyngshede: Permission request: Remove ticket field from permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1131639 [09:09:39] jouncebot: now [09:09:40] For the next 1 hour(s) and 50 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0900) [09:09:47] jouncebot: next [09:09:47] In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1000) [09:10:12] (03CR) 10Elukey: "It is not currently doable since libboost-all-dev is at version 1.73 on bookworm and mapnik 4.0.6 wants 1.83 (that is present in trixie)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131388 (https://phabricator.wikimedia.org/T389776) (owner: 10Elukey) [09:13:48] (03PS3) 10DCausse: cirrus: use only deployment-cirrussearch*.deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) [09:13:48] (03PS4) 10Jelto: deployment_server: add puppetdb rsync to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) [09:14:36] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.22 refs T386217 [09:14:41] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [09:15:16] (03Abandoned) 10DCausse: cirrus: allow writing to eqiad-opensearch in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131334 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [09:15:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10681782 (10phaultfinder) [09:17:06] (03CR) 10Jelto: deployment_server: add puppetdb rsync to external_services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [09:20:00] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139 (10MoritzMuehlenhoff) 03NEW [09:20:37] (03PS1) 10Brouberol: define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) [09:20:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T388372) (owner: 10DCausse) [09:20:43] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: mount the dumps configuration in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131635 (owner: 10Brouberol) [09:20:50] (03CR) 10Muehlenhoff: [C:04-1] "Ah good catch! I hadn't thought about the build deps, I've filed https://phabricator.wikimedia.org/T390139 to get these updated or removed" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [09:20:56] (03CR) 10CI reject: [V:04-1] define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:21:38] (03CR) 10Muehlenhoff: [C:03+2] Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1131632 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [09:22:52] (03PS2) 10Brouberol: define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) [09:22:52] (03PS1) 10Brouberol: Fix CI linting issue [dumps] - 10https://gerrit.wikimedia.org/r/1131641 (https://phabricator.wikimedia.org/T390059) [09:23:14] (03CR) 10CI reject: [V:04-1] define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:23:51] (03PS3) 10Brouberol: define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) [09:24:21] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: mount the dumps configuration in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131635 (owner: 10Brouberol) [09:27:33] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:27:42] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:27:46] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131642 (https://phabricator.wikimedia.org/T386217) [09:27:47] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131642 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:28:10] (03CR) 10Btullis: [C:03+1] "Nice." [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:28:40] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131642 (https://phabricator.wikimedia.org/T386217) (owner: 10TrainBranchBot) [09:28:44] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:28:55] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:29:02] (03CR) 10Brouberol: [C:03+2] Fix CI linting issue [dumps] - 10https://gerrit.wikimedia.org/r/1131641 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:29:10] (03CR) 10Brouberol: [C:03+2] define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:29:17] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:29:23] (03Merged) 10jenkins-bot: Fix CI linting issue [dumps] - 10https://gerrit.wikimedia.org/r/1131641 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:29:27] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:29:34] !log silence LogstashKafkaConsumerLag and LogstashIndexingFailures for today for 1d - T390140 [09:29:35] (03Merged) 10jenkins-bot: define signal handlers allowing to print the current stack or drop into a breakpoint [dumps] - 10https://gerrit.wikimedia.org/r/1131640 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] T390140: Eventstreams 'assignments' field type - https://phabricator.wikimedia.org/T390140 [09:29:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:30:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:31:16] (03PS1) 10Elukey: role::ml_k8s::master: move ml-serve-ctrl2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131649 (https://phabricator.wikimedia.org/T387854) [09:32:00] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [09:32:03] (03PS2) 10Muehlenhoff: Update the 1.19 image to be based on Bookworm, not bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131631 (https://phabricator.wikimedia.org/T383557) [09:32:14] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:32:54] (03PS2) 10Elukey: role::ml_k8s::master: move ml-serve-ctrl2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131649 (https://phabricator.wikimedia.org/T387854) [09:33:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [09:33:38] (03CR) 10Muehlenhoff: [C:03+2] autoinstall: Remove some Ubuntu traces [puppet] - 10https://gerrit.wikimedia.org/r/1131628 (owner: 10Muehlenhoff) [09:35:08] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5165/" [puppet] - 10https://gerrit.wikimedia.org/r/1131649 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:36:30] (03CR) 10Elukey: [V:03+1 C:03+2] role::ml_k8s::master: move ml-serve-ctrl2002 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1131649 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [09:41:23] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve-ctrl2002.codfw.wmnet with OS bookworm [09:41:25] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:32] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.22 refs T386217 [09:41:36] T386217: 1.44.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T386217 [09:44:25] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [09:45:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [09:45:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4005.ulsfo.wmnet [09:45:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:45:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:48:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4005.ulsfo.wmnet with OS bookworm [09:48:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10681880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm [09:49:57] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1126964 (https://phabricator.wikimedia.org/T288629) (owner: 10JMeybohm) [09:50:35] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [09:55:58] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [09:56:25] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:49] (03PS1) 10Jon Harald Søby: Change category collation for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131651 (https://phabricator.wikimedia.org/T310051) [09:57:49] FIRING: HelmReleaseBadStatus: Helm release mediawiki-dumps-legacy/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=mediawiki-dumps-legacy - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:58:15] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: host reimage [09:58:48] andre, should I add this ^ to the next deployment window, or should it maybe be deployed as part of the train deployment? [10:00:05] andre and jnuche: How many deployers does it take to do MediaWiki train - Utc-0 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T0900). [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1000) [10:00:36] damn, jouncebot is sassy [10:01:13] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: host reimage [10:01:33] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:01:35] Jhs: Train already left all stations (plus it's the last week of Daylight Confusion Time with now the colliding MW Infra window) so next window would be great [10:01:44] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:01:48] thanks so much [10:02:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131651 (https://phabricator.wikimedia.org/T310051) (owner: 10Jon Harald Søby) [10:03:25] (03PS1) 10Brouberol: mediawiki-dumps-legacy: simplify network policy management [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131653 (https://phabricator.wikimedia.org/T390059) [10:04:41] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: simplify network policy management [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131653 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:05:27] andre, 👍 added: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2287213&oldid=2287187 [10:05:31] thanks [10:06:17] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: simplify network policy management [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131653 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:06:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [10:08:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:08:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:09:09] (03CR) 10Btullis: [C:03+2] Update hadoop-test webrequest gobblin/purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/1131405 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [10:10:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4005.ulsfo.wmnet with reason: host reimage [10:10:48] andre: my deployment window is at the same time as yours, would please give me a headsup when you are done so I can deploy? [10:10:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:10:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:11:08] effie: oh sorry! Done with the train, go ahead [10:11:14] excellent! thank you! [10:11:40] !log joal@deploy1003 Started deploy [analytics/refinery@bc1b576] (hadoop-test): Analytics webrequest_frontend update TEST [analytics/refinery@bc1b5761] [10:12:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:12:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:12:11] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816 (owner: 10Effie Mouzeli) [10:13:08] !log joal@deploy1003 Finished deploy [analytics/refinery@bc1b576] (hadoop-test): Analytics webrequest_frontend update TEST [analytics/refinery@bc1b5761] (duration: 01m 27s) [10:13:37] (03Merged) 10jenkins-bot: mw-mcrouter: use monitoring.named_ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129816 (owner: 10Effie Mouzeli) [10:14:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:14:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:14:34] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:14:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:14:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:15:25] (03PS1) 10Clément Goubert: site.pp: Add missing wikikube-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1131655 (https://phabricator.wikimedia.org/T384970) [10:15:38] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:15:56] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:16:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:16:55] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:17:18] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:17:49] RESOLVED: HelmReleaseBadStatus: Helm release mediawiki-dumps-legacy/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=mediawiki-dumps-legacy - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:18:16] (03PS1) 10Brouberol: mediawiki-dumps-legacy: fix toolbox matchlabels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131656 (https://phabricator.wikimedia.org/T390059) [10:18:17] (03PS1) 10Brouberol: mediawiki-dumps-legacy: Reflect recent ceph fs volume increase in the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131657 (https://phabricator.wikimedia.org/T390059) [10:18:26] (03CR) 10Effie Mouzeli: "Thank you very much for this Jesse! One question, after applying this patch, will there be leftover apt sources under /etc/apt/sources.lis" [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [10:19:02] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: fix toolbox matchlabels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131656 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:19:05] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:19:20] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: Reflect recent ceph fs volume increase in the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131657 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:19:35] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve-ctrl2002.codfw.wmnet with OS bookworm [10:20:01] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: fix toolbox matchlabels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131656 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:20:06] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: Reflect recent ceph fs volume increase in the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131657 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:20:24] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:21:20] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: fix toolbox matchlabels [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131656 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:21:27] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Reflect recent ceph fs volume increase in the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131657 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [10:21:50] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:22:32] right, more dry-runs to do. If there's no objections... [10:22:45] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:23:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:23:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:24:58] 10ops-eqsin: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390147 (10phaultfinder) 03NEW [10:28:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4005.ulsfo.wmnet with OS bookworm [10:28:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10682035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4005.ulsfo.wmnet with OS bookworm completed: - ganeti4005 (**PASS*... [10:28:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:34:27] right, I'm going to archive Flow pages on cawikiquote, kabwiki and sewikimedia [10:35:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10682088 (10phaultfinder) [10:36:54] say, is there a command to post log messages in here from the command line? there must be... [10:37:05] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:37:26] (03PS1) 10Jelto: gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) [10:39:50] (03PS1) 10Zoe: Set Flow boards readonly on cawikiquote, kabwiki and sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131662 (https://phabricator.wikimedia.org/T380909) [10:41:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:55] ^^ is this worrying? [10:42:56] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:44:30] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:45:15] (03PS1) 10Filippo Giunchedi: pontoon: fix enroll --force [puppet] - 10https://gerrit.wikimedia.org/r/1131663 [10:45:21] zip: there's a quick and dirty way to do it that we use for helmfile logging, idt we have a ready made utility to do it, but you can take a look at /usr/local/bin/helmfile_log_sal on the deployment server [10:45:47] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [10:46:31] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5166/console" [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:46:48] (03PS1) 10Elukey: Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 [10:46:48] (03PS1) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:47:05] (03CR) 10CI reject: [V:04-1] Remove support for Python 3.7 and 3.8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [10:47:07] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:47:27] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10682142 (10cmooney) This has been stable ever since the replacement fwiw, no drops or errors etc. So I think we are good to close it {F58929621 width=600} [10:47:28] fabfur: it is being setup [10:47:48] claime: thanks! [10:48:01] effie: tnx! [10:48:02] most of what I'm after is just something to go "your script is done" tbh, I suppose i can always use my pushover account [10:48:47] (03CR) 10Jelto: [V:03+1] gitlab: rename thanos object storage parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:49:45] Hmm so not SAL logging, well if you have terminal bell you can always chain your script invoc with `&& echo -e '\a'` [10:49:50] zip ^ [10:50:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [10:50:20] that will ring your terminal bell and hopefully give you a notification, although not persistent [10:50:35] memcached errors expected, mcrouter is being roll-restarted [10:50:35] (03CR) 10Hnowlan: [C:03+1] site.pp: Add missing wikikube-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1131655 (https://phabricator.wikimedia.org/T384970) (owner: 10Clément Goubert) [10:50:45] (03CR) 10Clément Goubert: [C:03+2] site.pp: Add missing wikikube-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1131655 (https://phabricator.wikimedia.org/T384970) (owner: 10Clément Goubert) [10:51:47] good point [10:51:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10682176 (10Clement_Goubert) >>! In T384970#10681272, @Jhancock.wm wrote: > @Clement_Goubert hey i need a little favor. i noti... [10:53:49] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:54:06] !log zoe@deploy1003 manually-logged testing manual log helper script [10:55:39] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:55:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10682194 (10phaultfinder) [10:55:51] (03CR) 10Clément Goubert: [C:03+2] alertmanager: Add mediawiki-platform-task [puppet] - 10https://gerrit.wikimedia.org/r/1131025 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [10:56:43] (03CR) 10Clément Goubert: [C:03+2] team-sre: Add mw-cron alerting [alerts] - 10https://gerrit.wikimedia.org/r/1131356 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [10:56:49] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: fix enroll --force [puppet] - 10https://gerrit.wikimedia.org/r/1131663 (owner: 10Filippo Giunchedi) [10:57:15] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix enroll --force [puppet] - 10https://gerrit.wikimedia.org/r/1131663 (owner: 10Filippo Giunchedi) [10:58:18] (03Merged) 10jenkins-bot: team-sre: Add mw-cron alerting [alerts] - 10https://gerrit.wikimedia.org/r/1131356 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [10:58:47] FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:59:28] (03CR) 10Joal: [C:03+1] "LGTM! Thanks folks" [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [11:01:01] this is also expected, the HelmReleaseBadStatus, it will recover soon [11:01:40] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [11:03:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:05:06] <_joe_> !log manually installing python3-opensearch on mwlog1002, temporarily [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:05:32] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:05:39] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:06:04] memcached errors are expected again [11:06:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [11:07:48] FIRING: PuppetFailure: Puppet has failed on puppetserver2004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:08:32] (03CR) 10Hnowlan: [C:03+1] mw::periodic_job: Migrate blameStartupRegistry.php [puppet] - 10https://gerrit.wikimedia.org/r/1131037 (https://phabricator.wikimedia.org/T388540) (owner: 10Clément Goubert) [11:09:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:10:09] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10682227 (10tstarling) >>! In T389734#10676548, @Scott_French wrote: > @Krinkle or @tstarling -... [11:11:25] FIRING: [2x] SystemdUnitFailed: puppetserver.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [11:17:59] (03PS1) 10Filippo Giunchedi: logstash: stringify 'assignments' from eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/1131672 (https://phabricator.wikimedia.org/T390140) [11:18:29] (03CR) 10Filippo Giunchedi: "Based on I3d8985c0d" [puppet] - 10https://gerrit.wikimedia.org/r/1131672 (https://phabricator.wikimedia.org/T390140) (owner: 10Filippo Giunchedi) [11:18:41] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4005 [11:19:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4005 [11:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10682262 (10phaultfinder) [11:20:02] (03PS1) 10Federico Ceratto: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) [11:22:00] (03PS2) 10Federico Ceratto: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) [11:22:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [11:22:11] (03PS1) 10Joal: Remove webrequest_sampled_128 from turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1131674 (https://phabricator.wikimedia.org/T385198) [11:23:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4005.ulsfo.wmnet to cluster ulsfo and group 1 [11:23:42] (03PS3) 10Federico Ceratto: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) [11:23:57] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [11:26:00] (03CR) 10Effie Mouzeli: "LGTM 2 nits" [puppet] - 10https://gerrit.wikimedia.org/r/1131037 (https://phabricator.wikimedia.org/T388540) (owner: 10Clément Goubert) [11:26:01] (03PS1) 10Brouberol: mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) [11:26:02] (03PS1) 10Brouberol: Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) [11:26:03] (03PS1) 10Brouberol: mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) [11:26:06] (03PS1) 10Brouberol: mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) [11:26:23] (03PS1) 10Clément Goubert: mw-cron: Add warning for serviceops dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1131680 (https://phabricator.wikimedia.org/T385709) [11:27:11] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:27:14] (03CR) 10CI reject: [V:04-1] Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:27:15] (03Abandoned) 10Aklapper: Archive user talk pages even if the userpage doesn't exist [extensions/Flow] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131627 (https://phabricator.wikimedia.org/T380911) (owner: 10Zoe) [11:27:19] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:27:26] (03CR) 10CI reject: [V:04-1] mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:29:06] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on puppetserver2004.codfw.wmnet with reason: being setup [11:31:55] (03PS1) 10Ladsgroup: Bump thumb steps ratio to 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131682 (https://phabricator.wikimedia.org/T360589) [11:31:57] !log brouberol@deploy1003 Started scap build-images: T390059 - add signal handlers in dumps code to display a stacktrace [11:32:04] T390059: Large wiki dump is getting stuck when running in airflow - https://phabricator.wikimedia.org/T390059 [11:32:36] !log brouberol@deploy1003 Finished scap build-images: T390059 - add signal handlers in dumps code to display a stacktrace (duration: 00m 39s) [11:32:58] (03CR) 10Brouberol: [C:03+1] Remove webrequest_sampled_128 from turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1131674 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [11:33:01] jouncebot: nowandnext [11:33:01] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [11:33:01] In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1200) [11:33:02] (03CR) 10Clément Goubert: mw::periodic_job: Migrate blameStartupRegistry.php (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1131037 (https://phabricator.wikimedia.org/T388540) (owner: 10Clément Goubert) [11:33:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131682 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:34:27] (03Merged) 10jenkins-bot: Bump thumb steps ratio to 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131682 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [11:34:40] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131682|Bump thumb steps ratio to 50% (T360589)]] [11:34:45] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:34:49] (03CR) 10Btullis: [C:03+2] Remove webrequest_sampled_128 from turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1131674 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [11:35:23] (03CR) 10Brouberol: [C:03+2] Remove webrequest_sampled_128 from turnilo config [puppet] - 10https://gerrit.wikimedia.org/r/1131674 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [11:36:00] (03PS3) 10Brouberol: mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) [11:36:02] (03PS3) 10Brouberol: Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) [11:36:05] (03PS3) 10Brouberol: mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) [11:36:08] (03PS3) 10Brouberol: mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) [11:36:47] (03PS19) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [11:36:47] (03PS39) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [11:36:47] (03PS9) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [11:38:20] (03PS1) 10Brouberol: mediawiki-dumps-legacy: update to image to be able to display stack traces of runing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131683 (https://phabricator.wikimedia.org/T390059) [11:38:30] (03CR) 10Tiziano Fogli: netbox-hiera: adding pdu type (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [11:40:46] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1131682|Bump thumb steps ratio to 50% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:40:50] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:40:51] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:41:00] (03CR) 10Btullis: [C:03+1] Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:41:10] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:41:13] 06SRE, 06Infrastructure-Foundations: Create cookbook to update host network config based on Netbox - https://phabricator.wikimedia.org/T390163 (10cmooney) 03NEW p:05Triage→03Low [11:41:20] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:41:32] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: update to image to be able to display stack traces of runing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131683 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:43:12] (03CR) 10Tiziano Fogli: [C:03+1] hieradata: move k8s prometheus1005 -> 1007 [puppet] - 10https://gerrit.wikimedia.org/r/1131301 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:43:12] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:43:15] (03CR) 10Brouberol: [C:03+2] Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:43:17] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:43:20] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:43:23] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: update to image to be able to display stack traces of runing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131683 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:44:38] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: run an envoy pod sidecar in the toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131676 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:44:39] (03Merged) 10jenkins-bot: Add missing configuratiom file in the toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131677 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:44:54] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: grant the toolbox more resources, for large dumps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131678 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:44:55] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131679 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:45:02] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: update to image to be able to display stack traces of runing processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131683 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [11:46:07] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:49:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:51:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:52:45] (03PS20) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [11:52:45] (03PS40) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [11:52:45] (03PS10) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [11:53:00] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131682|Bump thumb steps ratio to 50% (T360589)]] (duration: 18m 20s) [11:53:05] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:53:30] (03CR) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [11:58:30] (03PS1) 10Effie Mouzeli: thumbor: use prometheus.io/scrape_by_name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1200) [12:02:01] (03Abandoned) 10Alexandros Kosiaris: mathoid: Upgrade all vendored modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006950 (owner: 10Clément Goubert) [12:02:11] jouncebot: nowandnext [12:02:11] For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1200) [12:02:11] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1300) [12:04:09] !log Updated security patches for T389235 [12:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:35] (03CR) 10Hnowlan: [C:03+1] "lgtm, but a suggestion" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [12:04:55] (03PS2) 10Alexandros Kosiaris: mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) [12:06:37] (03CR) 10Cathal Mooney: "LGTM overall. One nit/comment but it may have already been addressed ignore me if so." [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:07:24] (03PS3) 10Btullis: Temporarily exclude an-worker1202 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) [12:08:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [12:09:14] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5167/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [12:13:17] (03CR) 10Effie Mouzeli: thumbor: use prometheus.io/scrape_by_name (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [12:15:58] (03CR) 10Hnowlan: [C:03+1] thumbor: use prometheus.io/scrape_by_name (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [12:16:52] (03CR) 10Effie Mouzeli: [C:03+2] thumbor: use prometheus.io/scrape_by_name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [12:18:21] (03Merged) 10jenkins-bot: thumbor: use prometheus.io/scrape_by_name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131686 (https://phabricator.wikimedia.org/T389480) (owner: 10Effie Mouzeli) [12:20:24] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:20:31] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:21:17] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:22:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:23:05] (03PS3) 10Alexandros Kosiaris: mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) [12:23:05] (03PS1) 10Alexandros Kosiaris: ingress: Add the ability to have >1 destination rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131692 (https://phabricator.wikimedia.org/T384944) [12:23:06] (03PS1) 10Alexandros Kosiaris: ingress: Add the ability to have >1 destination rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131693 (https://phabricator.wikimedia.org/T384944) [12:23:08] (03PS1) 10Alexandros Kosiaris: mediawiki: Bump ingress to module 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131694 (https://phabricator.wikimedia.org/T384944) [12:25:17] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [12:27:53] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [12:29:29] (03PS2) 10Alexandros Kosiaris: ingress: Add the ability to have >1 destination rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131693 (https://phabricator.wikimedia.org/T384944) [12:29:29] (03PS2) 10Alexandros Kosiaris: mediawiki: Bump ingress to module 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131694 (https://phabricator.wikimedia.org/T384944) [12:29:47] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:30:21] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [12:32:43] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [12:33:01] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:33:29] (03PS1) 10Cyndywikime: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) [12:34:05] (03PS2) 10Cyndywikime: Growth: Remove GELevelingUpFeaturesEnabled and GEMentorDashboardEnabled feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) [12:34:17] (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [12:34:26] (03PS3) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [12:34:49] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@bbac659]: Keep airflow analytics_test up-to-date [12:35:03] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@bbac659]: Keep airflow analytics_test up-to-date (duration: 00m 14s) [12:35:30] (03Merged) 10jenkins-bot: mediawiki: Bump ingress module to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131636 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:36:27] !log enabling IPv6 on cloudsw devices in eqiad T389958 [12:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:32] T389958: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 [12:37:08] (03PS4) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [12:41:04] (03PS1) 10Ilias Sarantopoulos: ml-services: reduce num of cpu cores in reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131697 (https://phabricator.wikimedia.org/T387019) [12:41:07] (03CR) 10Stevemunene: [C:03+1] "lgtm! thanks for this" [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [12:42:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:43:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [12:43:52] (03CR) 10Klausman: [C:03+1] ml-services: reduce num of cpu cores in reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131697 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:44:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 (https://phabricator.wikimedia.org/T389609) (owner: 10Albertoleoncio) [12:46:23] (03PS5) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [12:46:47] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [12:53:10] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: reduce num of cpu cores in reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131697 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:54:28] (03Merged) 10jenkins-bot: ml-services: reduce num of cpu cores in reference-need [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131697 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [12:55:51] jouncebot: refresh [12:55:52] I refreshed my knowledge about deployments. [12:55:55] jouncebot: now [12:55:55] For the next 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1200) [12:56:01] jouncebot: nowandnext [12:56:01] For the next 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1200) [12:56:01] In 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1300) [12:56:23] !log isaranto@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:57:55] (03CR) 10Arnaudb: [C:03+1] gitlab: rename thanos object storage parameters [puppet] - 10https://gerrit.wikimedia.org/r/1131661 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:58:50] !log isaranto@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:59:04] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:59:28] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-misc: apply [12:59:42] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1300). [13:00:05] ottomata, tgr, dcausse, Jhs, seanleong-wmde, and hashar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] !log bump mw-misc to pick up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131636. It removes various hostnames from the SANs of mediawiki, but should be a noop [13:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:20] o/ [13:00:40] o/ [13:00:53] o/ [13:01:25] o/ [13:01:39] mine will be a no-op, i'm happy to quickly self deploy [13:01:41] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [13:01:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131482 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [13:01:43] I can deploy [13:01:47] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [13:01:52] okay tgr_ you go ahead [13:01:55] who else doesn't need to test (much)? [13:02:00] me! [13:02:18] tgr_: me [13:02:49] my change affects other systems unrelated to mw [13:03:16] hashar: you linked like twenty patches [13:03:17] same, actually ours are to the same config file (EventStreamConfig) and shoudln't affect each other. you could deploy them together [13:03:29] which of those do you want deployed? [13:04:06] tgr_: yes all of them;I will deploy them in a single batch at the end of the window if there is time remaining [13:04:55] mine is testable by not seeing a bunch of errors from ckbwiki 😅 [13:04:57] seanleong-wmde: you don't need to test the patch, right? [13:06:18] nope, it's deploying to beta cluster [13:06:41] !log adding IBGP peerings between loopbacks in cloud-vrf on cloudsw devices in eqiad T389958 [13:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:45] T389958: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 [13:06:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131413 (https://phabricator.wikimedia.org/T387908) (owner: 10Ottomata) [13:06:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T388372) (owner: 10DCausse) [13:06:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131480 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:06:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [13:17:04] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1130346 has a merge conflict [13:17:04] seanleong-wmde: ^ [13:17:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:17:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:17:04] zuul is taking a vacation [13:17:04] (03PS1) 10Brouberol: airflow-analytics-test: update extra_dag_folder to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131702 [13:18:42] yeah, looks that way ;( [13:19:13] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:19:13] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1211 - Depool db1211.eqiad.wmnet to then clone it to db1255.eqiad.wmnet - marostegui@cumin1002 [13:21:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1211 - Depool db1211.eqiad.wmnet to then clone it to db1255.eqiad.wmnet - marostegui@cumin1002 [13:21:08] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:21:21] tgr_ sorry, gimme a minute [13:25:01] (03Merged) 10jenkins-bot: ingress: Add the ability to have >1 destination rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131692 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:25:03] (03Merged) 10jenkins-bot: ingress: Add the ability to have >1 destination rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131693 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:25:38] (03CR) 10Hashar: "The `doc` environment fails under Python 3.12 because the doc uses autosummary which tries to import eg `docker_pkg.builder`." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [13:26:24] (03PS5) 10Bking: elasticsearch rolling-operation: add arguments for reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [13:27:26] (03Merged) 10jenkins-bot: EventStreamConfig - keep geoip-* headers in eventgate-logging-external streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131413 (https://phabricator.wikimedia.org/T387908) (owner: 10Ottomata) [13:27:27] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: update extra_dag_folder to main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131702 (owner: 10Brouberol) [13:27:29] (03Merged) 10jenkins-bot: wdqs: enable hive/hdfs ingestion for rdf update streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131286 (https://phabricator.wikimedia.org/T388372) (owner: 10DCausse) [13:27:32] (03Merged) 10jenkins-bot: Enable SUL3 login for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131480 (https://phabricator.wikimedia.org/T384219) (owner: 10Gergő Tisza) [13:27:47] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131413|EventStreamConfig - keep geoip-* headers in eventgate-logging-external streams (T387908 T387850)]], [[gerrit:1131286|wdqs: enable hive/hdfs ingestion for rdf update streams (T388372)]], [[gerrit:1131480|Enable SUL3 login for everyone (T384219)]] [13:27:47] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10682959 (10Jhancock.wm) all good. thank you for your help! [13:28:36] (03CR) 10Hashar: "To clarify: docker-pkg does not support python3.12, the docker dependency needs to be bumped for that." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [13:29:52] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Bump ingress to module 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131694 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:29:56] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10682999 (10aborrero) there was a major network outage as a result of the operations that affected all WMCS systems, including Ceph and Toolforge kubernetes. [13:30:18] tgr_, Hi, do you mind trying it again, I rebase and submitted a new patch. Thanks. ChangeId: 1130346 [13:32:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2300.codfw.wmnet with OS bookworm [13:32:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2300.codfw.wmnet with... [13:32:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2301.codfw.wmnet with OS bookworm [13:32:30] (03Merged) 10jenkins-bot: mediawiki: Bump ingress to module 1.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131694 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:32:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2301.codfw.wmnet with... [13:32:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2302.codfw.wmnet with OS bookworm [13:32:43] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2(055|056|061|062|069|073|074|075|076|087|088|089||090|091|111) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:32:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2(055|056|061|062|069|073|074|075|076|087|088|089||090|091|111) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:32:45] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2302.codfw.wmnet with... [13:32:55] !log tgr@deploy1003 dcausse, otto, tgr: Backport for [[gerrit:1131413|EventStreamConfig - keep geoip-* headers in eventgate-logging-external streams (T387908 T387850)]], [[gerrit:1131286|wdqs: enable hive/hdfs ingestion for rdf update streams (T388372)]], [[gerrit:1131480|Enable SUL3 login for everyone (T384219)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|*|090*|091*|111) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:34:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|*|090*|091*|111) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:34:21] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|*|090*|091*|111*) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:34:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|*|090*|091*|111*) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:34:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|090*|091*|111*) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:34:47] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2(055*|056*|061*|062*|069*|073*|074*|075*|076*|087*|088*|089*|090*|091*|111*) for begin OpenSearch migration - bking@cumin2002 - T388610 [13:35:25] inflatador: beware that we're serving search traffic from codfw (we haven't deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131359 yet) [13:36:14] tgr_, do you have a rough idea of when you'll get to my patch? i lost track of what order things are done in this window [13:36:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:36:32] I wouldn't mind throwing in a config change, but I imagine I should probably just schedule it for 20:00UTC and stay up late [13:36:37] !log tgr@deploy1003 dcausse, otto, tgr: Continuing with sync [13:37:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [13:37:21] Jhs: if zuul takes 20 min per merge then not soon :( [13:37:22] (03CR) 10Btullis: [V:03+1 C:03+2] Temporarily exclude an-worker1202 from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1131401 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [13:37:49] zip: if it doesn't need much testing, can just stack it on some other patch [13:39:03] tgr_, ack. fwiw, mine's attached to a UBN task, T390142 (nothing's actually broken per se, but apparently the current state throws a lot of errors from ckbwiki, and i imagine it would be nice to get rid of them) [13:41:47] tgr_: it does not [13:42:00] yeah I'd really appreciate getting that UBN one deployed because log noise ^ [13:42:12] here it be: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131662 [13:42:21] zip: ok, just add it to the wiki page [13:42:25] ta [13:42:56] tgr_: should I add it back into the wiki page as well or wait until the next window? [13:43:41] seanleong-wmde: no harm in having stuff on the page. Worst case it won't get deployed. [13:43:48] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131413|EventStreamConfig - keep geoip-* headers in eventgate-logging-external streams (T387908 T387850)]], [[gerrit:1131286|wdqs: enable hive/hdfs ingestion for rdf update streams (T388372)]], [[gerrit:1131480|Enable SUL3 login for everyone (T384219)]] (duration: 16m 01s) [13:43:52] it's just a beta patch though so it will [13:44:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2302.codfw.wmnet with reason: host reimage [13:44:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2301.codfw.wmnet with reason: host reimage [13:44:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2300.codfw.wmnet with reason: host reimage [13:44:33] (03PS4) 10Federico Ceratto: clone.py: Fix depooling source vs target [cookbooks] - 10https://gerrit.wikimedia.org/r/1131673 (https://phabricator.wikimedia.org/T388383) [13:44:33] (03PS1) 10Federico Ceratto: clone.py: Add logic to handle hosts unknown to dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1131711 (https://phabricator.wikimedia.org/T388383) [13:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10683094 (10phaultfinder) [13:45:37] tgr_: it's on the wiki [13:45:41] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:45:42] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:45:57] tgr_: Got it, thanks! Mine's current on the wiki alrdy as well [13:46:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131662 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [13:46:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (owner: 10Seanleong-wmde) [13:46:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131651 (https://phabricator.wikimedia.org/T310051) (owner: 10Jon Harald Søby) [13:46:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2302.codfw.wmnet with reason: host reimage [13:48:05] (03Merged) 10jenkins-bot: Set Flow boards readonly on cawikiquote, kabwiki and sewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131662 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe) [13:48:19] (03Merged) 10jenkins-bot: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130346 (owner: 10Seanleong-wmde) [13:48:21] (03Merged) 10jenkins-bot: Change category collation for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131651 (https://phabricator.wikimedia.org/T310051) (owner: 10Jon Harald Søby) [13:48:26] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:48:35] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131662|Set Flow boards readonly on cawikiquote, kabwiki and sewikimedia (T380909)]], [[gerrit:1130346|Increase entityAccessLimit from 400 to 500 for all wikis except commons.]], [[gerrit:1131651|Change category collation for ckbwiki (T310051 T390142)]] [13:48:47] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:49:48] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [13:49:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2301.codfw.wmnet with reason: host reimage [13:50:36] naturally I've just spotted that I missed one, RIP [13:50:43] oh well, another day, another backport... [13:52:40] ty! tgr_, i can verify that my patch works as expected. [13:52:57] !log tgr@deploy1003 jhsoby, seanleong-wmde, zoe, tgr: Backport for [[gerrit:1131662|Set Flow boards readonly on cawikiquote, kabwiki and sewikimedia (T380909)]], [[gerrit:1130346|Increase entityAccessLimit from 400 to 500 for all wikis except commons.]], [[gerrit:1131651|Change category collation for ckbwiki (T310051 T390142)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:53:02] tgr_: thanks! [13:53:13] tgr_: likewise [13:53:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2300.codfw.wmnet with reason: host reimage [13:53:31] tgr_: same, thanks! [13:53:34] Jhs: please check that there aren't lots of errors on ckbwiki :) [13:53:48] mine also seems to work on mwdebug, but i'm not sure where to check for errors [13:53:55] it also needs a script run though, as you hopefully saw [13:54:04] right [13:54:08] !log tgr@deploy1003 jhsoby, seanleong-wmde, zoe, tgr: Continuing with sync [13:54:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: P{P:netbox::host%location ~ "A.*codfw"} and P{O:elasticsearch::cirrus} for begin OpenSearch migration - bking@cumin2002 - T388610 [13:54:39] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: P{P:netbox::host%location ~ "A.*codfw"} and P{O:elasticsearch::cirrus} for begin OpenSearch migration - bking@cumin2002 - T388610 [13:58:46] (03PS1) 10Bking: elastic: Change first batch of prod elastic hosts to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1131715 (https://phabricator.wikimedia.org/T388610) [13:59:11] (03CR) 10CI reject: [V:04-1] elastic: Change first batch of prod elastic hosts to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1131715 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:00:24] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-worker1202.eqiad.wmnet [14:01:08] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131662|Set Flow boards readonly on cawikiquote, kabwiki and sewikimedia (T380909)]], [[gerrit:1130346|Increase entityAccessLimit from 400 to 500 for all wikis except commons.]], [[gerrit:1131651|Change category collation for ckbwiki (T310051 T390142)]] (duration: 12m 32s) [14:01:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:01:46] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Update the 1.19 image to be based on Bookworm, not bullseye-backports [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131631 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [14:01:47] !log running mwscript updateCollation.php --wiki=ckbwiki --previous-collation=xx-uca-ckb [14:02:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:02:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2302.codfw.wmnet with OS bookworm [14:02:04] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131339 (owner: 10Muehlenhoff) [14:02:07] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2302.codfw.wmnet with OS... [14:03:41] zip: if you want to add more stuff, the next hour is free and I have three more scaps to do [14:03:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [14:03:56] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [14:04:06] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:04:41] (03Merged) 10jenkins-bot: Fix badpass logging for locally nonexistent users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131444 (owner: 10Gergő Tisza) [14:04:53] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131444|Fix badpass logging for locally nonexistent users]] [14:04:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:06:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:06:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2301.codfw.wmnet with OS bookworm [14:06:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2301.codfw.wmnet with OS... [14:07:28] !log tgr@deploy1003 tgr: Backport for [[gerrit:1131444|Fix badpass logging for locally nonexistent users]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:08:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:08:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2300.codfw.wmnet with OS bookworm [14:08:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2300.codfw.wmnet with OS... [14:08:43] tgr_: lemme check in with Peter [14:08:50] (03CR) 10Cyndywikime: "This patch is now ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131696 (https://phabricator.wikimedia.org/T379566) (owner: 10Cyndywikime) [14:09:05] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10683191 (10Jhancock.wm) [14:16:16] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:16:20] !log tgr@deploy1003 tgr: Continuing with sync [14:18:18] (03CR) 10Klausman: [C:03+1] Create insetup role for ML servers with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131385 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:19:13] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:21:14] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10683254 (10bking) @Jclark-ctr , sorry for the confusion . `elastic112[3-5]` have been repurposed as relforge hosts, so they don't actually exist. Ref T384966 for more details. [14:21:47] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:24:35] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131444|Fix badpass logging for locally nonexistent users]] (duration: 19m 42s) [14:25:14] (03CR) 10Daniel Kinzler: [C:04-1] "One of the APIs has the wrong name. Other than that, looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [14:26:11] tgr_: have you finished the deploys? [14:26:47] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:27:55] hashar: I have a few more [14:27:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131639 (owner: 10Slyngshede) [14:28:00] Jhs: script is done [14:28:10] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for ML servers with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131385 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:28:31] ah it is still ongoing :b [14:28:57] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:29:14] (03CR) 10Hnowlan: [C:03+1] "@abreault@wikimedia.org - would you like me to deploy this change?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [14:29:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131481 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [14:30:13] tgr_, excellent. Thanks! [14:30:17] (03Merged) 10jenkins-bot: Enable SUL3 for temp users on group 0/1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131481 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [14:30:30] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131481|Enable SUL3 for temp users on group 0/1 (T384220)]] [14:30:34] T384220: SUL3 Phase 5: Staged rollout for all temporary accounts - https://phabricator.wikimedia.org/T384220 [14:32:01] (03CR) 10Arlolra: "Yes, please" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [14:32:45] (03PS3) 10Ssingh: sre.network.cf: log if no changes were made [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 [14:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10683309 (10phaultfinder) [14:35:13] !log tgr@deploy1003 tgr: Backport for [[gerrit:1131481|Enable SUL3 for temp users on group 0/1 (T384220)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:35:38] (03CR) 10Bking: [C:03+2] Add opensearch-knn [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1131068 (https://phabricator.wikimedia.org/T389812) (owner: 10DCausse) [14:36:42] (03CR) 10Silvan Heintze: [C:03+1] "LGTM, thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [14:36:51] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10683332 (10RobH) 05Open→03Resolved Awesome, I've updated the ticket to Interxion so they can close it. [14:38:26] (03CR) 10Slyngshede: [C:03+2] Permission request: Remove ticket field from permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1131639 (owner: 10Slyngshede) [14:39:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:39:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:40:13] !log uploaded Boost 1.83.0-4.1~wmf12u1 (backport of Boost 1.83 to Bookworm, needed by Mapnik 4.0.6) T389776 [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:18] T389776: Kartotherian slowly leaks memory until it reaches OOM - https://phabricator.wikimedia.org/T389776 [14:40:18] (03CR) 10Ssingh: sre.network.cf: log if no changes were made (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [14:40:24] (03CR) 10Elukey: "To keep archives happy - I had a chat with Antoine about the specific CI failure, that seems more related to the tox job. The distutil iss" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131665 (owner: 10Elukey) [14:45:29] !log tgr@deploy1003 tgr: Continuing with sync [14:48:30] (03Abandoned) 10Muehlenhoff: Assign puppetserver role to puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130069 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [14:48:39] (03PS3) 10Muehlenhoff: Configure puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130070 (https://phabricator.wikimedia.org/T381274) [14:49:23] (03PS2) 10Chuckonwumelu: Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 [14:50:02] (03CR) 10CI reject: [V:04-1] Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 (owner: 10Chuckonwumelu) [14:51:18] (03Merged) 10jenkins-bot: Permission request: Remove ticket field from permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1131639 (owner: 10Slyngshede) [14:51:31] jouncebot: now [14:51:31] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [14:52:30] Hey all - would like to get a sec mitigation deployed in PS.php related to the ongoing incident. Let me know if I shouldn't... [14:52:50] (03CR) 10Muehlenhoff: [C:03+2] Configure puppetserver2004 [puppet] - 10https://gerrit.wikimedia.org/r/1130070 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [14:52:58] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131481|Enable SUL3 for temp users on group 0/1 (T384220)]] (duration: 22m 27s) [14:53:02] T384220: SUL3 Phase 5: Staged rollout for all temporary accounts - https://phabricator.wikimedia.org/T384220 [14:53:54] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Add warning for serviceops dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1131680 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [14:53:59] hashar: sorry, ran out of time [14:54:10] that window was a bit ambitious :b [14:54:17] !log UTC afternoon deploys done [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:33] tgr_: have you managed to deploy your SUL3 rollout patches? [14:54:45] all but one [14:54:55] can wait until the evening [14:54:58] ok [14:55:06] (03Merged) 10jenkins-bot: mw-cron: Add warning for serviceops dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1131680 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [14:55:18] I am pushing those config cleanup patches [14:56:00] (03CR) 10Hashar: [C:03+2] Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:56:10] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683419 (10Scott_French) a:05joanna_borun→03Scott_French [14:57:21] (03Merged) 10jenkins-bot: Remove unnecessary boolean statement for $wmgIncreaseDefaultVectorFontSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127929 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:57:44] (03CR) 10Hashar: [C:03+2] Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:58:21] hashar: can you wait for sbassett? [14:58:27] (03CR) 10Hashar: [C:03+2] Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [14:58:30] that one is relatively urgent [14:58:34] (03Merged) 10jenkins-bot: Remove A/B test enrollment flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127930 (https://phabricator.wikimedia.org/T388905) (owner: 10Jdlrobson) [14:58:43] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10683425 (10cmooney) Just to confirm the timeline of events: (IMPORTANT) Mar 27 13:06:57: IBGP configuration commited on all 4 cloudsw, enabling IBGP in the... [14:58:59] which patch is that? [14:59:08] private [14:59:22] I can roll it together with the series of unused patch [14:59:28] s/unused patch/unused configs/ [14:59:40] (03CR) 10Muehlenhoff: [C:03+1] "/etc/apt/sources.list.d is managed via recurse/purge, so these are pruned by Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [15:00:01] !log installing setuptools security updates [15:00:02] (03Merged) 10jenkins-bot: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [15:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] andre and jnuche: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1500) [15:00:31] hashar: that works too, just make sure not to scap while he is in the middle of editing PrivateSettings [15:00:36] hashar: I just need to scap out PS.php real quick and make sure it’s stable, that ok? [15:00:48] a sync is like 20 minutes :) [15:01:04] let me +2 those pending patches and we can roll them all in one go [15:01:50] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:01:50] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:01:50] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:01:51] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgRelatedArticlesLoggingBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:02:33] sbassett: you can prepare the privatesettings.php fix and I will scap it with the other changes [15:02:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:02:46] hashar: Ok, I’ll wait if you’ll run a sync world. PS.php is fine syntax-wise, I’ll monitor mediawiki-errors. [15:02:53] (03Merged) 10jenkins-bot: Remove obsolete $wgMinervaApplyKnownTemplateHacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127887 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:02:56] (03Merged) 10jenkins-bot: Remove obsolete $wgPopupsEventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127897 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:02:59] (03Merged) 10jenkins-bot: Remove obsolete $wgPopupsOptInStateForNewAccounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127898 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:03:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:03:01] (03Merged) 10jenkins-bot: Remove obsolete $wgRelatedArticlesLoggingBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127900 (https://phabricator.wikimedia.org/T388905) (owner: 10Hashar) [15:03:04] PS.php is good to go right now, change is there, commited to private repo. [15:03:13] I am merging them one by one cause some might conflict with each others :) [15:03:31] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgMediaInfoMediaSearchHasLtrPlugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127886 (https://phabricator.wikimedia.org/T297863) (owner: 10Hashar) [15:03:40] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgNoticeFundraisingUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127889 (owner: 10Hashar) [15:03:42] (03CR) 10Hashar: [C:03+2] Remove obsolete $wgNoticeReporterDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127890 (https://phabricator.wikimedia.org/T232912) (owner: 10Hashar) [15:03:54] sometime I feel I should have done one single big patch [15:04:03] (03PS4) 10Filippo Giunchedi: benthos: update the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [15:04:25] (03Merged) 10jenkins-bot: Remove obsolete $wgMediaInfoMediaSearchHasLtrPlugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127886 (https://phabricator.wikimedia.org/T297863) (owner: 10Hashar) [15:04:31] (03Merged) 10jenkins-bot: Remove obsolete $wgNoticeFundraisingUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127889 (owner: 10Hashar) [15:04:33] (03Merged) 10jenkins-bot: Remove obsolete $wgNoticeReporterDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127890 (https://phabricator.wikimedia.org/T232912) (owner: 10Hashar) [15:04:34] (03CR) 10Hashar: [C:03+2] beta: remove obsolete $wgMwEmbedModuleConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) (owner: 10Hashar) [15:04:58] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10683453 (10BTullis) a:05BTullis→03Papaul [15:05:24] (03Merged) 10jenkins-bot: beta: remove obsolete $wgMwEmbedModuleConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127888 (https://phabricator.wikimedia.org/T100106) (owner: 10Hashar) [15:06:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74472 and previous config saved to /var/cache/conftool/dbconfig/20250327-150601-root.json [15:06:21] sbassett: ready to sync? [15:06:27] I am ready to press enter :) [15:06:36] hashar: /private/PS.php should be good to go, yes. [15:06:39] (03CR) 10Elukey: [C:03+2] benthos: update the webrequest_live instance [puppet] - 10https://gerrit.wikimedia.org/r/1131300 (https://phabricator.wikimedia.org/T390029) (owner: 10Elukey) [15:06:40] great!! [15:06:41] !log hashar@deploy1003 Started scap sync-world: Sync patch to PrivateSettings.php and removal of unused configs (Gerrit: 1127930 1127889 1127890 1127886 1125095 1127900 1127898 1127887 1127897 1127888 1127929) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:42] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:06:54] thanks for the wait, I have been trying to push that series of patch for some days :) [15:06:58] !log hashar@deploy1003 sync-world aborted: Sync patch to PrivateSettings.php and removal of unused configs (Gerrit: 1127930 1127889 1127890 1127886 1125095 1127900 1127898 1127887 1127897 1127888 1127929) (duration: 00m 16s) [15:07:42] (03PS1) 10Aqu: airflow-analytics-test: fix dependency on extra_dag_folders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131723 [15:07:44] !log hashar@deploy1003 Started scap sync-world: Sync patch to PrivateSettings.php and removal of unused configs (Gerrit: 1127930 1127889 1127890 1127886 1125095 1127900 1127898 1127887 1127897 1127888 1127929) [15:07:52] it is flying [15:10:44] 06SRE, 10SRE-Access-Requests: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683471 (10Scott_French) [15:11:59] (03PS1) 10Brouberol: airflow: reduce memory allotted to each task pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131726 [15:11:59] (03PS1) 10Brouberol: airflow-main.dse-k8e-eqiad: increase memory resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131727 [15:12:11] (03CR) 10Brouberol: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131723 (owner: 10Aqu) [15:14:38] 15:14:34 K8s deployment progress: 20% (ok: 498; fail: 0; left: 1916) - [15:14:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2074-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:15:40] !log update benthos@webrequest-live's config on centrallog nodes to new Kafka topics (haproxy vs varnishkafka) - T390029 [15:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:45] T390029: Migrate Benthos `webrequest_sampled_live` to feed from HAProxy data - https://phabricator.wikimedia.org/T390029 [15:15:53] hashar: looks good, if the PS.php mitigation was bad, I think we’d see it by now. [15:17:07] my guess is k8s only switch once all nodes have been moved [15:17:13] well I don't know really [15:18:22] (03CR) 10Volans: [C:03+1] "LGTM, I'll leave to Arzhel the final word" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [15:18:43] (03CR) 10Snwachukwu: [C:03+1] airflow: reduce memory allotted to each task pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131726 (owner: 10Brouberol) [15:19:00] (03CR) 10Jelto: [C:03+2] gerrit: raise heap limit from 32g to 64g [puppet] - 10https://gerrit.wikimedia.org/r/1130597 (https://phabricator.wikimedia.org/T387223) (owner: 10Hashar) [15:19:03] (03CR) 10Jelto: [C:03+2] gerrit: enable pushing notifications to browsers [puppet] - 10https://gerrit.wikimedia.org/r/1130656 (https://phabricator.wikimedia.org/T389327) (owner: 10Hashar) [15:19:04] No it doesn't [15:19:13] 10ops-codfw, 06SRE, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390062#10683529 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:19:36] !log hashar@deploy1003 Finished scap sync-world: Sync patch to PrivateSettings.php and removal of unused configs (Gerrit: 1127930 1127889 1127890 1127886 1125095 1127900 1127898 1127887 1127897 1127888 1127929) (duration: 11m 52s) [15:19:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:19:43] sbassett: done! [15:19:59] It scales up with new pods up to maxSurge, and deletes pods down to maxUnavailable. When a pod is ready to receive requests it receives requests, it receives requests. [15:20:18] Which means during a deployment you'll have a mix of pods running the old and new code responding to requests. [15:20:33] ahh so we serve traffic with mixed versions while the pods are upgraded? [15:20:36] Yes. [15:20:38] ah yeah [15:20:42] cool :-] [15:20:43] (03CR) 10Snwachukwu: airflow-main.dse-k8e-eqiad: increase memory resource quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131727 (owner: 10Brouberol) [15:21:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74473 and previous config saved to /var/cache/conftool/dbconfig/20250327-152106-root.json [15:21:23] Which is why there are three steps, first one is to stop at testservers so you can test directly using XWD, second to canaries which get actual traffic, with logstash watch for a sudden rise in errors, then full deploy to the rest of prod [15:21:37] back in the day we tried to reduce the span of time during which we had mixed versions [15:21:44] So do we. [15:21:53] This is as fast as we can do it without service interruption [15:21:53] since requests from a single client could then be served by different versions which could have side effects [15:21:58] (03CR) 10Brouberol: airflow-main.dse-k8e-eqiad: increase memory resource quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131727 (owner: 10Brouberol) [15:22:47] but I don't think we ever had the issue, or that was not a source of errors [15:22:51] (03CR) 10Ottomata: [C:03+2] eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131415 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [15:22:54] anyway, thank you for the update :) [15:24:26] (03Merged) 10jenkins-bot: eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131415 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [15:24:34] Is the beta cluster being down (503s) related to what's currently being done here? [15:25:03] (03PS3) 10Chuckonwumelu: Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 [15:26:27] (03PS1) 10Slyngshede: D:apereo_cas::service do not excluded unfiltered attributes [puppet] - 10https://gerrit.wikimedia.org/r/1131730 [15:26:42] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10683573 (10Jhancock.wm) [15:26:55] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:28:10] !log upgrading eventgate-logging-external to node20 (using new per stream header enrich setting), first testing in staging. - T383814, T387908 [15:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:17] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [15:28:18] T387908: [BUG] eventgate-logging-external drops previously collected http request headers - https://phabricator.wikimedia.org/T387908 [15:28:20] Kemayo: this place is mostly for production/infra. I have merged a bunch of clean up patch for mediawiki-config which maybe might affect beta cluster. That really depends on the issue you are seeing [15:28:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [15:28:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [15:28:52] hashar: Yeah, I just saw various beta config stuff getting merged, so I figured it might be something active happening. [15:28:59] And nobody really owns beta, so there's no good place to ask about it. [15:28:59] !log Restarting Gerrit to raise heap from 32G to 64G (T387223) and to enable pushing notifications to browsers (T389327) [15:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:05] T387223: Remove explicit enablement of G1 garbage collector for Gerrit - https://phabricator.wikimedia.org/T387223 [15:29:05] T389327: Enable browser notifications system in Gerrit - https://phabricator.wikimedia.org/T389327 [15:29:07] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5168/co" [puppet] - 10https://gerrit.wikimedia.org/r/1131730 (owner: 10Slyngshede) [15:29:18] jouncebot: nowandnext [15:29:18] For the next 0 hour(s) and 30 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1500) [15:29:18] In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1600) [15:29:26] hashar: As for what's happening, beta 503s on all pages. E.g. https://en.wikipedia.beta.wmflabs.org/ [15:29:27] Kemayo: here is ok, or you can also ask in #wikimedia-releng [15:29:34] ah [15:29:47] well it is worth filing this as an unbreak now against #beta-cluster-infrastructure [15:29:58] "No server is available to handle this request" [15:30:04] but maybe people are acting on it and they should be syncing in #wikimedia-releng if they are doing something [15:30:33] (restarting gerrit) [15:31:27] Fair, I'll ask over there. [15:32:38] o/ we'd like to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131359 (we depooled a full row of elastic machines in codfw and search latencies are pretty bad) [15:33:08] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10683594 (10cmooney) @aborrero @taavi one thing we could maybe try, if we wanted to make progress sooner (i.e. without replicating the setup elsewhere): * Add... [15:33:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683595 (10Scott_French) Once I am able to confirm @Chuckonwumelu's SSH public key out of band, I believe this should be ready to go. [15:36:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74474 and previous config saved to /var/cache/conftool/dbconfig/20250327-153612-root.json [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:42] (03CR) 10Cathal Mooney: sre.puppet.sync-netbox-hiera: add data::pdus to exports (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [15:43:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [15:43:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131359 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [15:43:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [15:44:15] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:44:18] codfw elasticsearch is complaining because we banned a row to start a platform upgrade, the cluster isn't happy with the reduced node count so patch above moves all our traffic to eqiad [15:44:23] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:44:24] (03Merged) 10jenkins-bot: Move cirrus traffic to eqiad for platform upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131359 (https://phabricator.wikimedia.org/T388610) (owner: 10Ebernhardson) [15:44:37] jouncebot: nowandnext [15:44:37] For the next 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1500) [15:44:37] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1600) [15:44:39] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1131359|Move cirrus traffic to eqiad for platform upgrade (T388610)]] [15:44:43] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [15:44:44] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [15:44:47] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [15:44:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10683637 (10MoritzMuehlenhoff) [15:48:01] (03PS1) 10Elukey: benthos: bump webrequest_live instance's thread to 48 [puppet] - 10https://gerrit.wikimedia.org/r/1131747 (https://phabricator.wikimedia.org/T390029) [15:49:37] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1131359|Move cirrus traffic to eqiad for platform upgrade (T388610)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:49:46] (03PS1) 10Hnowlan: trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) [15:49:59] (03CR) 10Brouberol: [C:03+2] airflow-main.dse-k8e-eqiad: increase memory resource quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131727 (owner: 10Brouberol) [15:50:07] (03CR) 10Brouberol: [C:03+2] airflow: reduce memory allotted to each task pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131726 (owner: 10Brouberol) [15:50:11] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [15:50:15] (03CR) 10Andrew Bogott: [C:03+1] Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 (owner: 10Chuckonwumelu) [15:51:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74475 and previous config saved to /var/cache/conftool/dbconfig/20250327-155117-root.json [15:52:25] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:52:54] (03CR) 10David Caro: [C:03+1] Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 (owner: 10Chuckonwumelu) [15:52:58] 10ops-eqiad, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390064#10683658 (10phaultfinder) [15:54:03] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10683664 (10phaultfinder) [15:54:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1131730 (owner: 10Slyngshede) [15:54:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131705 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:56:59] (03PS1) 10Federico Ceratto: clone.py: Add basic functional test [cookbooks] - 10https://gerrit.wikimedia.org/r/1131750 (https://phabricator.wikimedia.org/T388383) [15:57:28] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131359|Move cirrus traffic to eqiad for platform upgrade (T388610)]] (duration: 12m 49s) [15:57:37] (03CR) 10Andrew Bogott: [C:03+2] Add Chuck key [puppet] - 10https://gerrit.wikimedia.org/r/1131633 (owner: 10Chuckonwumelu) [15:58:29] (03PS1) 10BCornwall: upgrade cp2027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131752 (https://phabricator.wikimedia.org/T378737) [15:58:30] (03PS1) 10BCornwall: upgrade cp2028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131753 (https://phabricator.wikimedia.org/T378737) [15:58:31] (03PS1) 10BCornwall: upgrade cp2029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131754 (https://phabricator.wikimedia.org/T378737) [15:58:33] (03PS1) 10BCornwall: upgrade cp2030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131755 (https://phabricator.wikimedia.org/T378737) [15:58:34] (03PS1) 10BCornwall: upgrade cp2031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131756 (https://phabricator.wikimedia.org/T378737) [15:58:36] (03PS1) 10BCornwall: upgrade cp2032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131757 (https://phabricator.wikimedia.org/T378737) [15:58:37] (03PS1) 10BCornwall: upgrade cp2033 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131758 (https://phabricator.wikimedia.org/T378737) [15:58:41] (03PS1) 10BCornwall: upgrade cp2034 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131759 (https://phabricator.wikimedia.org/T378737) [15:58:45] (03PS1) 10BCornwall: upgrade cp2035 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131760 (https://phabricator.wikimedia.org/T378737) [15:58:49] (03PS1) 10BCornwall: upgrade cp2036 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131761 (https://phabricator.wikimedia.org/T378737) [15:58:57] (03PS1) 10BCornwall: upgrade cp2037 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131762 (https://phabricator.wikimedia.org/T378737) [15:59:01] (03PS1) 10BCornwall: upgrade cp2038 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131763 (https://phabricator.wikimedia.org/T378737) [15:59:05] (03PS1) 10BCornwall: upgrade cp2039 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131764 (https://phabricator.wikimedia.org/T378737) [15:59:09] (03PS1) 10BCornwall: upgrade cp2040 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131765 (https://phabricator.wikimedia.org/T378737) [15:59:13] (03PS1) 10BCornwall: upgrade cp2041 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131766 (https://phabricator.wikimedia.org/T378737) [15:59:17] (03PS1) 10BCornwall: upgrade cp2042 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131767 (https://phabricator.wikimedia.org/T378737) [15:59:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:59:40] !log `sudo systemctl restart burrow-jumbo-eqiad.service prometheus-burrow-exporter@jumbo-eqiad.service` on kafkamon1003 - attempt to check if the new kafka lag for benthos-webrequest_live is due to burrow - T390029 [15:59:57] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [16:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] T390029: Migrate Benthos `webrequest_sampled_live` to feed from HAProxy data - https://phabricator.wikimedia.org/T390029 [16:00:18] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1600). [16:00:18] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:25] o/ [16:00:26] o/ [16:01:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683710 (10MoritzMuehlenhoff) Access got enabled via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131633 (but that lacked a Bug: T header) [16:01:16] dancy: I was going to ask real quick, did you want to use ls -a to avoid blowing away a php-* directory with only hidden files? or do we figure that's just not a situation we'll ever be in [16:01:59] hmm.. I didn't consider hidden files. Lemme look into that. [16:02:48] sure [16:04:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:05:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:05:49] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:06:00] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:06:13] (03PS2) 10BPirkle: REST: Enable REST Sandbox on an initial set of production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) [16:06:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74476 and previous config saved to /var/cache/conftool/dbconfig/20250327-160623-root.json [16:06:37] !log jgiannelos@deploy1003 Started deploy [restbase/deploy@3349f02]: Deprecate unused RB codebase [16:07:27] rzl: OK. I thought about the scenario where a php-* directory consists only of a "cache" subdirectory and some number of hidden files.... that would still be an unusable mediawiki tree.. worthy of (effective) exclusion from the rsync. [16:07:46] rzl: Thanks for making me think about it more carefully. [16:08:00] makes sense! happy merging this as-is, then? [16:08:05] Yes please [16:08:19] (03CR) 10RLazarus: [C:03+2] scap-master-sync: Clean up orphaned php-* directories after rsync [puppet] - 10https://gerrit.wikimedia.org/r/1130723 (https://phabricator.wikimedia.org/T389830) (owner: 10Ahmon Dancy) [16:08:20] jouncebot: nowandnext [16:08:20] For the next 0 hour(s) and 51 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1600) [16:08:20] In 0 hour(s) and 51 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700) [16:08:20] In 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700) [16:08:39] (03CR) 10BPirkle: REST: Enable REST Sandbox on an initial set of production wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [16:08:53] Amir1: we'll be done with the puppet window shortly if you want it [16:09:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2331 to codfw - jhancock@cumin2002" [16:09:08] thanks! merging the patches will take a while [16:09:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2331 to codfw - jhancock@cumin2002" [16:09:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2331 [16:09:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2331 [16:10:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:58] (03PS1) 10Ladsgroup: LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131768 [16:11:07] (03PS1) 10Ladsgroup: LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131769 [16:11:12] (03CR) 10Ladsgroup: [C:03+2] LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131769 (owner: 10Ladsgroup) [16:11:15] (03CR) 10Ladsgroup: [C:03+2] LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131768 (owner: 10Ladsgroup) [16:14:21] !log Run `foreachwikiindblist growthexperiments CommunityConfiguration:setVersionData CommunityUpdates 2.0.2` # T387737 [16:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:26] T387737: Community updates module: allow to set a white background for images in dark mode - https://phabricator.wikimedia.org/T387737 [16:14:33] dancy: ready to test at deploy1003 [16:15:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:00] !log dancy@deploy1003 Started scap sync-world: Testing T389830 [16:16:04] T389830: Scap seemingly doesn't fully/properly clean backup deployment server - https://phabricator.wikimedia.org/T389830 [16:16:28] (03PS1) 10Bking: WIP: Begin CODFW transition to OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1131773 (https://phabricator.wikimedia.org/T388610) [16:16:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10683754 (10MoritzMuehlenhoff) [16:17:32] rzl: Can you run puppet on deploy2002.codfw.wmnet please? [16:17:41] (03CR) 10Filippo Giunchedi: sre.puppet.sync-netbox-hiera: add data::pdus to exports (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:17:48] !log dancy@deploy1003 sync-world aborted: Testing T389830 (duration: 01m 48s) [16:18:03] oh sure, sorry, I didn't think about that [16:18:05] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1131752 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:18:18] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [16:18:31] (03CR) 10Filippo Giunchedi: [C:03+1] sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:19:39] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:22:09] !log Run `foreachwikiindblist growthexperiments CommunityConfiguration:migrateConfig CommunityUpdates 2.0.3`# T387737 [16:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:14] T387737: Community updates module: allow to set a white background for images in dark mode - https://phabricator.wikimedia.org/T387737 [16:22:23] dancy: done [16:22:33] rzl: thx. retesting [16:23:00] !log dancy@deploy1003 Started scap sync-world: Testing T389830 [16:23:04] T389830: Scap seemingly doesn't fully/properly clean backup deployment server - https://phabricator.wikimedia.org/T389830 [16:23:15] (03PS1) 10Dreamy Jazz: CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131774 (https://phabricator.wikimedia.org/T379178) [16:23:21] (03Merged) 10jenkins-bot: LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131769 (owner: 10Ladsgroup) [16:23:22] (03Merged) 10jenkins-bot: LoginAttemptCounter: Add extra hardening for long period too [extensions/ConfirmEdit] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1131768 (owner: 10Ladsgroup) [16:24:38] !log dancy@deploy1003 dancy: Testing T389830 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:24:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:24:45] !log dancy@deploy1003 Sync cancelled. [16:24:49] rzl: Tested good. Thanks for deploying! [16:25:25] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T390077#10683768 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:25:34] dancy: cool, thanks! [16:25:42] Amir1: puppet window's finished, all yours [16:25:43] (03PS2) 10Hnowlan: trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) [16:25:46] !log jgiannelos@deploy1003 Finished deploy [restbase/deploy@3349f02]: Deprecate unused RB codebase (duration: 19m 23s) [16:25:51] thanks! [16:26:25] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 (owner: 10Andrew Bogott) [16:27:01] (03CR) 10Andrew Bogott: [C:03+2] toolforge_redirector: increase monitoring timeout [puppet] - 10https://gerrit.wikimedia.org/r/1123797 (https://phabricator.wikimedia.org/T385908) (owner: 10Andrew Bogott) [16:27:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:27:48] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131768|LoginAttemptCounter: Add extra hardening for long period too]], [[gerrit:1131769|LoginAttemptCounter: Add extra hardening for long period too]] [16:28:31] (03PS1) 10Bking: elastic: enable performance governor for selected hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131775 (https://phabricator.wikimedia.org/T388610) [16:28:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131775 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:30:25] (03CR) 10Fabfur: [C:03+1] upgrade cp2028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131753 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:30:38] (03CR) 10Scott French: [C:03+1] trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [16:30:46] (03CR) 10Fabfur: [C:03+1] upgrade cp2029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131754 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:30:54] (03CR) 10Fabfur: [C:03+1] upgrade cp2030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131755 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:02] (03CR) 10Fabfur: [C:03+1] upgrade cp2031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131756 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:08] (03CR) 10Fabfur: [C:03+1] upgrade cp2032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131757 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:15] (03CR) 10Fabfur: [C:03+1] upgrade cp2033 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131758 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:22] (03CR) 10Fabfur: [C:03+1] upgrade cp2034 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131759 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:27] (03CR) 10Fabfur: [C:03+1] upgrade cp2035 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131760 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:34] (03CR) 10Fabfur: [C:03+1] upgrade cp2036 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131761 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:47] (03CR) 10Fabfur: [C:03+1] upgrade cp2041 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131766 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:31:53] (03CR) 10Fabfur: [C:03+1] upgrade cp2042 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131767 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:32:04] (03CR) 10Fabfur: [C:03+1] upgrade cp2040 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:32:11] (03CR) 10Fabfur: [C:03+1] upgrade cp2039 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131764 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:32:18] (03CR) 10Fabfur: [C:03+1] upgrade cp2038 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131763 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:32:24] (03CR) 10Fabfur: [C:03+1] upgrade cp2037 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131762 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:34:06] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1131768|LoginAttemptCounter: Add extra hardening for long period too]], [[gerrit:1131769|LoginAttemptCounter: Add extra hardening for long period too]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:34:39] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:36:23] The current set works, that I can check [16:36:31] I can't make 100 attempts [16:37:04] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [16:37:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1123.eqiad.wmnet with OS bullseye [16:37:22] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10683850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye [16:39:13] slacker [16:39:22] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1211.eqiad.wmnet onto db1255.eqiad.wmnet [16:39:44] 💔 [16:39:53] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk in slot 10 for an-worker1194 - https://phabricator.wikimedia.org/T389065#10683862 (10Jclark-ctr) 05Open→03Resolved Replaced failed drive [16:40:42] (03CR) 10Marostegui: [C:03+1] "This works fine, I've tested it, there are some things we need to tackle, I will create a task (as they have nothing to do with this)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131711 (https://phabricator.wikimedia.org/T388383) (owner: 10Federico Ceratto) [16:41:01] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10683867 (10phaultfinder) [16:42:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:42:57] (03CR) 10BCornwall: [C:03+2] upgrade cp2028 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131753 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:43:08] (03CR) 10BCornwall: [C:03+2] upgrade cp2027 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131752 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:43:53] (03PS1) 10Joal: Update webrequest_sampled_live druid deep-storage [puppet] - 10https://gerrit.wikimedia.org/r/1131778 (https://phabricator.wikimedia.org/T385198) [16:43:58] !log dcaro@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on cloudcephosd1029.eqiad.wmnet with reason: Installing a disk for testing [16:44:09] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683889 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2db5921e-9fd3-4768-9222-3e33bdad8325) set by dcaro@cumin1002 for 20 days, 0:00... [16:44:21] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131768|LoginAttemptCounter: Add extra hardening for long period too]], [[gerrit:1131769|LoginAttemptCounter: Add extra hardening for long period too]] (duration: 16m 33s) [16:44:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2027.codfw.wmnet} and A:cp [16:44:50] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2028.codfw.wmnet} and A:cp [16:44:53] (03PS2) 10Clément Goubert: api-gateway: Lower ratelimit log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131777 (https://phabricator.wikimedia.org/T390215) [16:45:12] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683907 (10dcaro) [16:45:13] (03PS1) 10Dreamy Jazz: ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin [extensions/IPReputation] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131779 (https://phabricator.wikimedia.org/T390197) [16:45:56] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10683910 (10Jhancock.wm) done. [16:46:12] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10683911 (10dcaro) @VRiley-WMF hi! cloudcephosd1029 is ready to get one disk replaced by the dell new one :) It's turned off and all, so just turn it on w... [16:47:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:47:10] (03CR) 10Hnowlan: [C:03+1] "I reluctantly say lgtm - ratelimit outputs essentially nothing (including errors) with non-debug logging unfortunately, but now that it's " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131777 (https://phabricator.wikimedia.org/T390215) (owner: 10Clément Goubert) [16:47:22] (03CR) 10Ladsgroup: [C:03+2] ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin [extensions/IPReputation] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131779 (https://phabricator.wikimedia.org/T390197) (owner: 10Dreamy Jazz) [16:47:26] (03CR) 10Ladsgroup: [C:03+2] CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131774 (https://phabricator.wikimedia.org/T379178) (owner: 10Dreamy Jazz) [16:47:56] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Lower ratelimit log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131777 (https://phabricator.wikimedia.org/T390215) (owner: 10Clément Goubert) [16:48:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1124.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:49:01] (03Merged) 10jenkins-bot: CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131774 (https://phabricator.wikimedia.org/T379178) (owner: 10Dreamy Jazz) [16:49:29] (03Merged) 10jenkins-bot: api-gateway: Lower ratelimit log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131777 (https://phabricator.wikimedia.org/T390215) (owner: 10Clément Goubert) [16:49:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2028.codfw.wmnet} and A:cp [16:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74479 and previous config saved to /var/cache/conftool/dbconfig/20250327-164945-root.json [16:50:05] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2027.codfw.wmnet} and A:cp [16:51:12] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [16:51:28] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [16:51:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:51:57] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:52:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683929 (10Scott_French) Ah, thanks @MoritzMuehlenhoff! @Chuckonwumelu and @Andrew - Could you please confirm this is no longer needed? If so, I will abandon https... [16:55:17] (03PS1) 10Brouberol: Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) [16:55:41] (03CR) 10CI reject: [V:04-1] Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:56:09] (03PS1) 10Kosta Harlan: GlobalContributions: Add API query module [extensions/CheckUser] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131782 (https://phabricator.wikimedia.org/T390156) [16:56:10] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10683937 (10RobH) Part and engineer dispatch set for April 2nd. They'll send over the tech info 48 hours in advance so I can file the ticket 24 hours in advance for security/escort as needed. [16:56:22] (03PS2) 10Brouberol: Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) [16:56:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10683940 (10MoritzMuehlenhoff) [16:56:24] (03Merged) 10jenkins-bot: ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin [extensions/IPReputation] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131779 (https://phabricator.wikimedia.org/T390197) (owner: 10Dreamy Jazz) [16:56:36] Amir1: can we backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1131782 as well please? [16:56:49] (03CR) 10CI reject: [V:04-1] Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [16:57:08] Noting that has new i18n messages, so will take a lot longer [16:57:29] (03PS3) 10Brouberol: Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) [16:58:56] (but certainly doable) [16:59:17] I prefer to deploy on thing at a time [16:59:28] in case we need to revert, etc. [17:00:05] bd808: Time to do the Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700) [17:02:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131774|CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts (T379178)]], [[gerrit:1131779|ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin (T390197)]] [17:02:58] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [17:02:58] T390197: IPReputation: Support showing a CAPTCHA on Special:UserLogin - https://phabricator.wikimedia.org/T390197 [17:04:13] Amir1: sure [17:04:28] it's less urgent for sure. Could wait til the next window [17:04:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:04:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10683990 (10Scott_French) Alright, I guess this is no longer needed, then. I've confirmed that the public key used in https://gerrit.wikimedia.org/r/c/operations/pu... [17:04:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74480 and previous config saved to /var/cache/conftool/dbconfig/20250327-170451-root.json [17:05:47] (03PS1) 10BPirkle: REST: fix extra routes module localization strings [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131784 (https://phabricator.wikimedia.org/T385855) [17:05:53] Just after I redeployed it to log less, srsly api-gateway [17:06:25] hnowlan: I think we should not alert for 5xx on the rate_limit_cluster [17:06:37] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-03-26-234702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131785 (https://phabricator.wikimedia.org/T388051) [17:06:46] (03CR) 10DCausse: "should be ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [17:07:57] !log ladsgroup@deploy1003 dreamyjazz, ladsgroup: Backport for [[gerrit:1131774|CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts (T379178)]], [[gerrit:1131779|ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin (T390197)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:08:03] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [17:08:03] T390197: IPReputation: Support showing a CAPTCHA on Special:UserLogin - https://phabricator.wikimedia.org/T390197 [17:08:27] claime: hmm, yep (although Elevated is distinct from High now at least) [17:08:47] Dreamy_Jazz: it's live in mwdebug [17:08:55] kostajh: ^ wanna test it [17:09:11] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-03-26-234702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131785 (https://phabricator.wikimedia.org/T388051) (owner: 10BryanDavis) [17:09:36] RESOLVED: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:09:37] (03PS1) 10Alexandros Kosiaris: cxserver: Bump all sextant modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131786 [17:09:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131784 (https://phabricator.wikimedia.org/T385855) (owner: 10BPirkle) [17:09:56] If kostajh isn't around to test it, I can always see if I can test it. [17:10:18] I can try to test it [17:10:24] Please try as well Dreamy_Jazz [17:10:34] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-03-26-234702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131785 (https://phabricator.wikimedia.org/T388051) (owner: 10BryanDavis) [17:11:12] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131778 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [17:11:13] Sure. I'll try using a VPN which should appear in IPoid data. [17:11:17] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [17:11:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10684038 (10phaultfinder) [17:11:52] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [17:12:51] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [17:13:22] please ping me when you're done! [17:13:27] I tested with a VPN that is known to IPoid. With MWDEbug + VPN, the captcha shows [17:13:29] Amir1: it works fine [17:13:32] lets do it [17:13:36] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [17:13:40] okie dokie [17:13:43] !log ladsgroup@deploy1003 dreamyjazz, ladsgroup: Continuing with sync [17:13:45] Yeah, getting a captcha too [17:13:46] with MWDebug and no VPN, no captcha [17:13:56] and confirmed that I can log in if I solve the captcha [17:14:09] our very hard captcha [17:14:12] Same for me (except I didn't check the last step) [17:14:32] Though I can do that when I re-log into my volunteer account :D [17:15:23] hey, better than last year's captcha [17:16:25] Yeah, logging in worked for me too (with 2FA steps included in that check) [17:17:08] (03CR) 10BCornwall: [C:03+2] upgrade cp2030 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131755 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:17:09] (03CR) 10BCornwall: [C:03+2] upgrade cp2029 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131754 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:17:16] (03PS2) 10Ottomata: eventgate-analytics-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124144 (https://phabricator.wikimedia.org/T383814) [17:17:55] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:18:51] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2029.codfw.wmnet} and A:cp [17:18:53] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2030.codfw.wmnet} and A:cp [17:18:54] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:19:05] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:19:48] (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1131705 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [17:19:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74481 and previous config saved to /var/cache/conftool/dbconfig/20250327-171956-root.json [17:20:03] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:20:10] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:20:14] (03CR) 10Alexandros Kosiaris: [C:03+1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [17:20:46] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:20:56] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131774|CaptchaPreAuthenticationProvider: Run triggerCaptcha for login attempts (T379178)]], [[gerrit:1131779|ConfirmEditTriggersCaptcha: Support showing a CAPTCHA on Special:UserLogin (T390197)]] (duration: 18m 03s) [17:21:01] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [17:21:01] T390197: IPReputation: Support showing a CAPTCHA on Special:UserLogin - https://phabricator.wikimedia.org/T390197 [17:21:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10684063 (10Marostegui) >>! In T388684#10683910, @Jhancock.wm wrote: > done. Thank you! I don't see the disk rebuilding, it seems to be marked as bad and the RAID is still degra... [17:22:26] (03PS2) 10Alexandros Kosiaris: cxserver: Bump all sextant modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131786 [17:22:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10684064 (10Scott_French) [17:22:41] (03CR) 10Alexandros Kosiaris: [C:03+1] cxserver: Bump all sextant modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131786 (owner: 10Alexandros Kosiaris) [17:22:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to wmcs-roots for chuckonwumelu - https://phabricator.wikimedia.org/T389817#10684065 (10Scott_French) 05In progress→03Resolved a:05Scott_French→03Andrew Since this was taken care of out of band by WMCS SRE, I'm marking this as Re... [17:23:21] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2030.codfw.wmnet} and A:cp [17:23:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131775 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:24:27] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2029.codfw.wmnet} and A:cp [17:24:49] !log btullis@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:24:50] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-worker1202.eqiad.wmnet [17:24:57] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10684076 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `an-worker1202.eqiad.w... [17:25:33] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [17:25:36] FIRING: GatewayBackendErrorsElevated: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsElevated [17:26:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:28:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:04] (03CR) 10Ottomata: [C:03+2] eventgate-analytics-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124144 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [17:30:27] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1131306 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [17:30:34] (03Merged) 10jenkins-bot: eventgate-analytics-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124144 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [17:32:25] (03CR) 10Hnowlan: [C:03+2] Allow dot in revision title [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [17:33:50] (03Merged) 10jenkins-bot: Allow dot in revision title [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [17:34:37] !log upgrading eventgate-analytics-external to node20 - T383814 [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:42] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [17:34:50] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [17:35:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74482 and previous config saved to /var/cache/conftool/dbconfig/20250327-173502-root.json [17:35:27] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [17:36:00] jouncebot: nowandnext [17:36:01] For the next 0 hour(s) and 23 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700) [17:36:01] For the next 0 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T1700) [17:36:01] In 2 hour(s) and 23 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T2000) [17:36:26] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [17:37:18] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [17:38:36] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [17:39:14] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [17:40:01] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10684119 (10Marostegui) Even with megacli no luck: ` root@db2243:/home/marostegui# megacli -PDOnline -PhysDrv [252:4] -aALL Adapter: 0: Failed to change PD state at EnclId-252... [17:40:41] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:40:48] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:41:26] (03PS4) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) [17:42:23] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:42:37] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:42:38] (03CR) 10Papaul: [C:03+1] elastic: enable performance governor for selected hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131775 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:43:13] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:43:20] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:44:44] (03CR) 10Ssingh: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [17:45:11] (03CR) 10CDobbins: [C:03+2] "Because" [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [17:46:00] !log cdobbins@dns1004 START - running authdns-update [17:46:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10684131 (10BTullis) The above cookbook failure is because I looked away and failed to run the DNS cookbook in time. Ev... [17:46:58] (03CR) 10Bking: [C:03+2] elastic: enable performance governor for selected hosts [puppet] - 10https://gerrit.wikimedia.org/r/1131775 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:48:37] !log cdobbins@dns1004 END - running authdns-update [17:48:53] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10684156 (10BTullis) [17:49:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1123.eqiad.wmnet with OS bullseye [17:49:09] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10684157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye executed with errors: - ela... [17:50:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74485 and previous config saved to /var/cache/conftool/dbconfig/20250327-175008-root.json [17:50:41] 10ops-ulsfo, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10684165 (10RobH) a:03ayounsi #netops, Was this expected and/or resolved? I don't see any down interfaces now. (shows up xe-0/1/4 UP UP) but error count is something else. If we can resolve, please do so ot... [17:50:48] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10684167 (10RobH) [17:52:25] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:53:52] (03CR) 10Arlolra: "@hnowlan@wikimedia.org Thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1130728 (https://phabricator.wikimedia.org/T389628) (owner: 10Arlolra) [17:53:55] (03CR) 10Btullis: [C:03+1] Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [17:54:08] (03CR) 10BCornwall: [C:03+2] upgrade cp2031 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131756 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:54:11] (03CR) 10BCornwall: [C:03+2] upgrade cp2032 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131757 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:55:29] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2031.codfw.wmnet} and A:cp [17:55:30] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2032.codfw.wmnet} and A:cp [17:58:06] (03PS1) 10Ottomata: eventgate-main - upgrade to NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131792 (https://phabricator.wikimedia.org/T383814) [18:00:17] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2032.codfw.wmnet} and A:cp [18:00:32] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2031.codfw.wmnet} and A:cp [18:05:21] (03PS1) 10Ebernhardson: Accept data path as a cli arg [software/elasticsearch/madvise] - 10https://gerrit.wikimedia.org/r/1131796 (https://phabricator.wikimedia.org/T390118) [18:11:06] (03PS2) 10Ebernhardson: Accept data path as a cli arg [software/elasticsearch/madvise] - 10https://gerrit.wikimedia.org/r/1131796 (https://phabricator.wikimedia.org/T390118) [18:11:49] (03PS1) 10HMonroy: Enable Codex and Multiblocks in Polish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) [18:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10684298 (10phaultfinder) [18:19:54] (03CR) 10Ebernhardson: [C:03+1] cirrus: use only deployment-cirrussearch*.deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [18:21:56] (03CR) 10MusikAnimal: Enable Codex and Multiblocks in Polish wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:22:10] !log dancy@deploy1003 Installing scap version "4.146.0" for 2 host(s) [18:27:35] (03PS3) 10Dwisehaupt: community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) [18:27:51] (03PS5) 10Dwisehaupt: community_civicrm: dovecot module for serving up local mail [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) [18:28:05] (03PS4) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) [18:28:12] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131799 [18:29:44] !log dancy@deploy1003 Installation of scap version "4.146.0" completed for 2 hosts [18:30:34] (03CR) 10BCornwall: [C:03+2] upgrade cp2033 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131758 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:30:36] (03CR) 10BCornwall: [C:03+2] upgrade cp2034 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131759 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:33:02] jouncebot: now [18:33:02] No deployments scheduled for the next 1 hour(s) and 26 minute(s) [18:34:34] (03PS2) 10HMonroy: Enable Codex and Multiblocks in Polish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) [18:35:15] (03CR) 10HMonroy: Enable Codex and Multiblocks in Polish wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:37:44] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2033.codfw.wmnet} and A:cp [18:37:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2034.codfw.wmnet} and A:cp [18:41:20] Hey all - would like to do a quick incident-related deployment via scap backport, two gerrits. Just getting them merged to .22 rn. [18:42:15] OK w/ me. [18:42:24] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2034.codfw.wmnet} and A:cp [18:43:45] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2033.codfw.wmnet} and A:cp [18:45:59] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:47:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1125.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:51:21] (03CR) 10Cathal Mooney: "Nice work. Few comments under the relevant sections'." [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [18:53:27] (03PS1) 10Jclark-ctr: Bug: T387356 [puppet] - 10https://gerrit.wikimedia.org/r/1131801 [18:53:50] (03CR) 10CI reject: [V:04-1] Bug: T387356 [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (owner: 10Jclark-ctr) [18:54:15] (03CR) 10MusikAnimal: Enable Codex and Multiblocks in Polish wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:55:39] (03CR) 10Jclark-ctr: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (owner: 10Jclark-ctr) [18:56:10] (03PS3) 10HMonroy: Enable Codex and Multiblocks in Polish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) [18:56:50] (03CR) 10HMonroy: Enable Codex and Multiblocks in Polish wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [18:59:09] (03PS2) 10Jclark-ctr: Bug: T387356 [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) [18:59:32] (03CR) 10CI reject: [V:04-1] Bug: T387356 [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) (owner: 10Jclark-ctr) [18:59:55] (03CR) 10Jclark-ctr: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) (owner: 10Jclark-ctr) [19:00:00] (03CR) 10Jclark-ctr: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) (owner: 10Jclark-ctr) [19:00:38] (03PS1) 10SBassett: LoginNotify#sendNotice: Add IP and UA to log message [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131803 (https://phabricator.wikimedia.org/T390141) [19:03:54] (03PS3) 10Bking: Add elastic112[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) (owner: 10Jclark-ctr) [19:05:22] (03CR) 10Aaron Schulz: [C:03+1] REST: fix extra routes module localization strings [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131784 (https://phabricator.wikimedia.org/T385855) (owner: 10BPirkle) [19:05:26] (03CR) 10Bking: [C:03+2] Add elastic112[345] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1131801 (https://phabricator.wikimedia.org/T387356) (owner: 10Jclark-ctr) [19:06:38] !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 202053 [19:07:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 202053 [19:08:33] (03PS1) 10SBassett: CaptchaPreAuthenticationProvider: Check if a login attempt would trigger a captcha in testForAuthentication [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131804 (https://phabricator.wikimedia.org/T379178) [19:09:02] sbassett: Lemme know when you're done please. I have another scap release to deploy. [19:10:52] dancy: I’d say go if you’re ready. I’m still waiting for these to test on .22, taking a while... [19:10:59] ok. [19:11:06] !log dancy@deploy1003 Installing scap version "4.147.0" for 2 host(s) [19:12:20] (03CR) 10BCornwall: [C:03+2] upgrade cp2035 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131760 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:12:22] (03CR) 10BCornwall: [C:03+2] upgrade cp2036 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131761 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:13:37] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2035.codfw.wmnet} and A:cp [19:13:38] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2036.codfw.wmnet} and A:cp [19:17:25] (03CR) 10Brouberol: [C:03+1] Update webrequest_sampled_live druid deep-storage [puppet] - 10https://gerrit.wikimedia.org/r/1131778 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [19:18:54] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2035.codfw.wmnet} and A:cp [19:18:57] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:19:33] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2036.codfw.wmnet} and A:cp [19:19:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:19:57] 10ops-eqiad, 06SRE, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389992#10684514 (10phaultfinder) [19:20:17] !log dancy@deploy1003 Installation of scap version "4.147.0" completed for 2 hosts [19:21:40] sbassett: Mind if deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1131804 ? [19:21:45] (when ready) [19:22:05] CI just finished. [19:22:08] dancy: fine with me, but there’s 2 others that also need to go... [19:22:21] As a batch, or individually? [19:22:34] https://gerrit.wikimedia.org/r/1131803 and https://gerrit.wikimedia.org/r/1131782 [19:23:02] I don’t think there’s any depends-on worries with these [19:23:26] so if you want to backport all 3, that’d be great. Or I can when you’re done. [19:23:43] I'll just do one so I can test spiderpig. [19:23:48] ok [19:24:18] you want me to +2? [19:24:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131804 (https://phabricator.wikimedia.org/T379178) (owner: 10SBassett) [19:24:23] I got it. [19:24:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:25:22] (03Merged) 10jenkins-bot: CaptchaPreAuthenticationProvider: Check if a login attempt would trigger a captcha in testForAuthentication [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131804 (https://phabricator.wikimedia.org/T379178) (owner: 10SBassett) [19:25:38] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1131804|CaptchaPreAuthenticationProvider: Check if a login attempt would trigger a captcha in testForAuthentication (T379178)]] [19:25:42] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [19:27:24] (03PS1) 10Andrew Bogott: openstack neutron: allow all users to see agents [puppet] - 10https://gerrit.wikimedia.org/r/1131806 [19:28:22] (03CR) 10Andrew Bogott: [C:03+2] openstack neutron: allow all users to see agents [puppet] - 10https://gerrit.wikimedia.org/r/1131806 (owner: 10Andrew Bogott) [19:30:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:32:02] !log dancy@deploy1003 sbassett, dancy: Backport for [[gerrit:1131804|CaptchaPreAuthenticationProvider: Check if a login attempt would trigger a captcha in testForAuthentication (T379178)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:32:07] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [19:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10684575 (10phaultfinder) [19:35:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2303 to codfw - jhancock@cumin2002" [19:35:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2303 to codfw - jhancock@cumin2002" [19:35:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:19] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2303 [19:36:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2303 [19:36:47] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2304 [19:36:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2304 [19:36:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2305 [19:37:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2305 [19:37:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2306 [19:37:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2306 [19:37:44] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2307 [19:37:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2307 [19:38:00] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2308 [19:38:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2308 [19:38:51] 06SRE, 10WMF-General-or-Unknown, 07Performance Issue, 07Wikimedia-production-error: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" errors - https://phabricator.wikimedia.org/T389734#10684598 (10brennen) It seems like the right folks are already aware of this, but noting that l... [19:39:45] sbassett: Ready for testing. [19:40:55] thanks [19:41:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:42:18] looking [19:43:17] it works as intended now [19:43:20] dancy: lgtm [19:43:24] ok proceeding [19:43:28] !log dancy@deploy1003 sbassett, dancy: Continuing with sync [19:44:07] (03CR) 10Andrew Bogott: [C:03+2] "> and might fix pint complaining about a missing neutron metric" [puppet] - 10https://gerrit.wikimedia.org/r/1131806 (owner: 10Andrew Bogott) [19:45:18] (03CR) 10Jforrester: "<3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131792 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [19:45:22] (03CR) 10BCornwall: [C:03+2] upgrade cp2037 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131762 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:45:24] (03CR) 10BCornwall: [C:03+2] upgrade cp2038 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131763 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:45:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2309 to codfw - jhancock@cumin2002" [19:45:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2309 to codfw - jhancock@cumin2002" [19:45:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:46:42] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2037.codfw.wmnet} and A:cp [19:46:43] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2038.codfw.wmnet} and A:cp [19:46:56] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2309 [19:47:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2309 [19:47:11] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2311 [19:47:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2311 [19:47:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2312 [19:47:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2312 [19:47:40] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2313 [19:47:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2313 [19:48:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10684631 (10wiki_willy) [19:50:36] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131804|CaptchaPreAuthenticationProvider: Check if a login attempt would trigger a captcha in testForAuthentication (T379178)]] (duration: 24m 58s) [19:50:41] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [19:51:03] dancy: am I good to backport the other two now? [19:51:06] sbassett: Back to you. Thanks for letting me test. [19:51:13] thx! [19:51:20] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2038.codfw.wmnet} and A:cp [19:51:52] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2037.codfw.wmnet} and A:cp [19:52:26] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:52:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/CheckUser] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131782 (https://phabricator.wikimedia.org/T390156) (owner: 10Kosta Harlan) [19:52:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131803 (https://phabricator.wikimedia.org/T390141) (owner: 10SBassett) [19:53:19] (03CR) 10Cwhite: [C:03+2] prometheus: add recording rules for use by histogram_quantile [puppet] - 10https://gerrit.wikimedia.org/r/1130689 (https://phabricator.wikimedia.org/T383963) (owner: 10Cwhite) [19:56:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2322 to codfw - jhancock@cumin2002" [19:56:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2322 to codfw - jhancock@cumin2002" [19:56:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:38] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2314 [19:56:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2314 [19:56:50] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2315 [19:57:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2315 [19:57:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2316 [19:57:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2316 [19:58:07] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2317 [19:58:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2317 [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and thcipriani: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T2000). [20:00:05] albertoleoncio and bpirkle: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] I'm here [20:00:14] Hi! [20:00:16] dibs! I can deploy :) [20:00:33] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2318 [20:00:47] o/ (last minute addition) [20:00:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2318 [20:00:58] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2319 [20:01:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2319 [20:01:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2320 [20:01:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2320 [20:01:40] the security deploys are probably still ongoing [20:03:42] ah, ok [20:04:29] tgr_: to save me from backscroll: who should I ping to let me know when we're clear? [20:05:35] sbassett: Tyler is willing to handle your remaining backports too [20:06:07] thcipriani: ^ [20:06:14] thanks both <3 [20:06:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:06:58] (03Merged) 10jenkins-bot: GlobalContributions: Add API query module [extensions/CheckUser] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131782 (https://phabricator.wikimedia.org/T390156) (owner: 10Kosta Harlan) [20:07:01] (03Merged) 10jenkins-bot: LoginNotify#sendNotice: Add IP and UA to log message [extensions/LoginNotify] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131803 (https://phabricator.wikimedia.org/T390141) (owner: 10SBassett) [20:07:18] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1131782|GlobalContributions: Add API query module (T390156)]], [[gerrit:1131803|LoginNotify#sendNotice: Add IP and UA to log message (T390141)]] [20:07:23] T390156: GlobalContributions: Expose data via the API - https://phabricator.wikimedia.org/T390156 [20:07:23] T390141: Login from new device: Ensure login-fail-new has IP and user agent data - https://phabricator.wikimedia.org/T390141 [20:08:55] hey all, yeah, just waiting on sync-world to finish for the last two security backports [20:09:01] then we should be done for a while [20:09:29] sbassett: thanks for update, ping me when you're done and I'll do UTC late backport <3 [20:09:42] albertoleoncio: bpirkle tgr_ ^ FYI [20:09:51] ack [20:10:00] yep [20:10:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2323 to codfw - jhancock@cumin2002" [20:10:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2323 to codfw - jhancock@cumin2002" [20:10:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:10:50] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2321 [20:10:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2321 [20:11:01] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2322 [20:11:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2322 [20:11:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2323 [20:11:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2323 [20:11:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2324 [20:11:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2324 [20:11:37] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2325 [20:11:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2325 [20:11:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2326 [20:11:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2326 [20:12:02] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2328 [20:12:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2328 [20:12:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2329 [20:12:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2329 [20:12:25] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:12:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2330 [20:12:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2330 [20:13:17] (03CR) 10LorenMora: [C:03+1] Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131484 (https://phabricator.wikimedia.org/T390112) (owner: 10Jdlrobson) [20:13:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:13:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:14:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2307.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:15:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2308.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:15:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:16:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-ulsfo and Hurricane Electric (2001:504:30::ba00:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:16:52] (03CR) 10LorenMora: [C:03+1] Deploy dark mode and Vector 2022 to German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131483 (https://phabricator.wikimedia.org/T387155) (owner: 10Jdlrobson) [20:17:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:18:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:19:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:20:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:21:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2314.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:22:19] (03CR) 10BCornwall: [C:03+2] upgrade cp2039 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131764 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:22:20] (03CR) 10BCornwall: [C:03+2] upgrade cp2040 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:23:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2315.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2316.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2317.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2307.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:25:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2308.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:26:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:26:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:26:39] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2040.codfw.wmnet} and A:cp [20:26:40] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2039.codfw.wmnet} and A:cp [20:26:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:27:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2321.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:28:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2322.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:28:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2323.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:29:28] thcipriani: huh, just got a big error dump from scap during sync-testservers-k8s. looks like it failed on one of the hosts, but scap didn’t exit yet [20:29:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:26] sbassett: can I see the big error dump? [20:30:31] a bunch of helmfile failures when scap got to the 4th test server: "Deployment of mw-misc-main failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1" [20:30:49] ok, scap just failed for me [20:30:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:30:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2324.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:05] sbassett: anything else in the log? [20:31:09] thcipriani: do you want the full error dump? I can slack or paste it somewhere. [20:31:14] yes please [20:31:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2325.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:40] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2040.codfw.wmnet} and A:cp [20:31:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:56] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2039.codfw.wmnet} and A:cp [20:32:01] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:32:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2326.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:32:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2314.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:32:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2315.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2328.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:34:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2329.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:34:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2316.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:34:59] !log scap backport failed, investigating [20:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2330.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:35:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2317.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2291.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:37:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2321.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:38:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2322.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:39:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2323.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:39:53] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1131803|LoginNotify#sendNotice: Add IP and UA to log message (T390141)]], [[gerrit:1131782|GlobalContributions: Add API query module (T390156)]] [20:39:59] T390141: Login from new device: Ensure login-fail-new has IP and user agent data - https://phabricator.wikimedia.org/T390141 [20:39:59] T390156: GlobalContributions: Expose data via the API - https://phabricator.wikimedia.org/T390156 [20:40:12] Hey all - re-running the failed scap backport again… [20:41:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2324.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2325.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2303.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:42:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2326.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:44:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2328.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:45:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2329.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:45:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2330.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:48:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2307.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2308.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:50:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:52:01] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:54:25] Ok, scap backport just failed again for me, so I’m holding off for now. [20:55:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2307.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2308.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2304.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2309.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2306.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:55:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2305.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:57:51] So... thcipriani? [20:57:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2314.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:45] albertoleoncio: bpirkle tgr_ we are experience some strangeness with MediaWiki images. Unclear how long that will take: a bit. I'm still here if you're still here, but if you need to drop, let me know. [20:58:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2315.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:59:09] I'm still here [20:59:28] I'll be here too [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T2100) [21:00:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2310.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:00:18] same [21:04:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2314.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:04:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2315.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:04:45] !log dancy@deploy1003 Started scap sync-world: Testing deployments [21:05:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2311.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:05:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2312.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:05:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2313.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:06:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2316.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:06:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2317.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:07:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2321.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:12:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2316.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:12:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2317.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:13:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2321.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:13:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2318.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:13:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2319.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:13:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2320.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:15:12] (03CR) 10BCornwall: [C:03+2] upgrade cp2041 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131766 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:15:12] (03CR) 10BCornwall: [C:03+2] upgrade cp2042 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131767 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:16:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2322.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:16:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2323.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:16:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2324.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:16:56] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2041.codfw.wmnet} and A:cp [21:16:57] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp2042.codfw.wmnet} and A:cp [21:16:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2325.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:17:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2326.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:17:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:17:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2327.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:19:11] !log dancy@deploy1003 Started scap sync-world: Testing deployments [21:21:18] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2042.codfw.wmnet} and A:cp [21:21:29] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp2041.codfw.wmnet} and A:cp [21:21:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2322.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2323.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2324.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2325.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:22:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2326.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:24:29] (03PS1) 10BCornwall: upgrade cp1100 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131824 (https://phabricator.wikimedia.org/T378737) [21:24:29] (03PS1) 10BCornwall: upgrade cp1101 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131825 (https://phabricator.wikimedia.org/T378737) [21:24:30] (03PS1) 10BCornwall: upgrade cp1102 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131826 (https://phabricator.wikimedia.org/T378737) [21:24:32] (03PS1) 10BCornwall: upgrade cp1103 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131827 (https://phabricator.wikimedia.org/T378737) [21:24:33] (03PS1) 10BCornwall: upgrade cp1104 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131828 (https://phabricator.wikimedia.org/T378737) [21:24:35] (03PS1) 10BCornwall: upgrade cp1105 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131829 (https://phabricator.wikimedia.org/T378737) [21:24:37] (03PS1) 10BCornwall: upgrade cp1106 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131830 (https://phabricator.wikimedia.org/T378737) [21:24:41] (03PS1) 10BCornwall: upgrade cp1107 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131831 (https://phabricator.wikimedia.org/T378737) [21:24:45] (03PS1) 10BCornwall: upgrade cp1108 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131832 (https://phabricator.wikimedia.org/T378737) [21:24:49] (03PS1) 10BCornwall: upgrade cp1109 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131833 (https://phabricator.wikimedia.org/T378737) [21:24:53] (03PS1) 10BCornwall: upgrade cp1110 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131834 (https://phabricator.wikimedia.org/T378737) [21:24:57] (03PS1) 10BCornwall: upgrade cp1111 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131835 (https://phabricator.wikimedia.org/T378737) [21:25:01] (03PS1) 10BCornwall: upgrade cp1112 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131836 (https://phabricator.wikimedia.org/T378737) [21:25:05] (03PS1) 10BCornwall: upgrade cp1113 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131837 (https://phabricator.wikimedia.org/T378737) [21:25:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2330.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:09] (03PS1) 10BCornwall: upgrade cp1114 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131838 (https://phabricator.wikimedia.org/T378737) [21:25:13] (03PS1) 10BCornwall: upgrade cp1115 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131839 (https://phabricator.wikimedia.org/T378737) [21:25:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2329.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2328.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:25:48] albertoleoncio: bpirkle tgr_ sbassett trying a full image rebuild for weirdness, FYI. This takes..a bit (30 mins? that's a guess. Two versions take 50 mins and we're down to 1. Usually we do this once a week :)) [21:26:10] Ok [21:26:19] ok [21:26:38] ok, thanks for the update. worst case for me is to do the backport tomorrow morning. which shouldn’t be too controversial for the two backports in question. [21:27:22] sbassett: your stuff is going out with this since it was already staged [21:28:09] oh, ok, great! [21:30:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2330.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:30:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2329.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:30:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2328.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:38:31] sbassett: fyi your think is on test servers, FYI (and a tad beyond, too, since this is a full sync) [21:43:35] !log dancy@deploy1003 Finished scap sync-world: Testing deployments (duration: 24m 24s) [21:44:36] ^ sbassett all sync'd! [21:44:48] hooray, thanks! [21:45:15] albertoleoncio: bpirkle tgr_ quick break and then back at it! [21:45:21] ok [21:49:35] albertoleoncio: still around? I can get your out [21:50:23] *your change out [21:50:28] hi [21:50:47] albertoleoncio: hi, let's get your namespace change deployed [21:51:06] ok =D [21:51:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 (https://phabricator.wikimedia.org/T389609) (owner: 10Albertoleoncio) [21:51:45] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240 (10RobH) 03NEW p:05Triage→03High [21:52:11] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10684939 (10RobH) [21:52:15] (03Merged) 10jenkins-bot: Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130101 (https://phabricator.wikimedia.org/T389609) (owner: 10Albertoleoncio) [21:52:29] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1130101|Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. (T389609)]] [21:52:34] T389609: Add "PRE" and "CAT" as namespace aliases in ptwiki - https://phabricator.wikimedia.org/T389609 [21:52:54] thcipriani: can I add one more, if there's time? [21:54:01] kostajh: technically we're an hour over the window, but we're just getting started :) [21:55:37] thcipriani: ok. Maybe I'll ask for a deploy tomorrow then. [21:56:13] kostajh: if it's important enough to warrant Friday deploy, let's get it done now, I think [21:56:45] as long as I don't hit strange problems with this (we had a weird image error that delayed) [21:57:04] Its working already on debug mode [21:57:08] !log thcipriani@deploy1003 thcipriani, albertoleoncio: Backport for [[gerrit:1130101|Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. (T389609)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:57:34] albertoleoncio: perfect, yeah, our deploy tool was running checks before telling us to test, sounds like it's good to go live, continuing [21:57:43] !log thcipriani@deploy1003 thcipriani, albertoleoncio: Continuing with sync [21:57:44] yep, good to go [21:58:31] thcipriani: let me see if I can find someone who will shepherd this patch along with you, as I want to sign off soon. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1131841 [21:58:57] jouncebot: nownanext [21:59:01] jouncebot: nownandext [21:59:06] jouncebot: nownandnext [21:59:11] jouncebot: nowandnext [21:59:11] For the next 0 hour(s) and 0 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250327T2100) [21:59:11] In 8 hour(s) and 0 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250328T0600) [21:59:15] gj Reedy [21:59:19] :D [22:00:46] Its on live now [22:02:15] 10ops-eqiad, 06SRE, 06DC-Ops: example sub-task for relocation out of D6 - https://phabricator.wikimedia.org/T390243 (10RobH) 03NEW [22:02:37] 10ops-eqiad, 06SRE, 06DC-Ops: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10685007 (10RobH) [22:02:42] thcipriani: looks like one of Reedy or tgr_ will be around for the patch. [22:05:30] 10ops-eqiad, 06SRE, 06DC-Ops: example sub-task for relocation out of D6 - https://phabricator.wikimedia.org/T390243#10685027 (10RobH) @Jclark-ctr, If you think this example task outlines everything we need, I can create them for each sre team listed in the parent task and assign to the sre team managers for... [22:05:45] kostajh: okie doke [22:05:53] !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130101|Add "PRE" (for NS_TEMPLATE) and "CAT" (for NS_CATEGORY) as namespace aliases in ptwiki. (T389609)]] (duration: 13m 23s) [22:05:58] T389609: Add "PRE" and "CAT" as namespace aliases in ptwiki - https://phabricator.wikimedia.org/T389609 [22:06:06] ^ albertoleoncio should be live, I'll check namespacedupes [22:07:18] For example, this works already: https://pt.wikipedia.org/wiki/PRE:ER [22:09:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1125.eqiad.wmnet with OS bullseye [22:09:23] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1125.eqiad.wmnet with OS bullseye [22:09:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1125.eqiad.wmnet with OS bullseye [22:09:36] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1125.eqiad.wmnet with OS bullseye execu... [22:09:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1123.eqiad.wmnet with OS bullseye [22:10:06] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye [22:10:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1124.eqiad.wmnet with OS bullseye [22:11:00] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1124.eqiad.wmnet with OS bullseye [22:12:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1125.eqiad.wmnet with OS bullseye [22:12:16] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685046 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1125.eqiad.wmnet with OS bullseye [22:13:37] albertoleoncio: looks like 33 things to fix, running fix now [22:13:51] Where? [22:14:54] https://phabricator.wikimedia.org/P74490 [22:15:24] Oh, I see... [22:15:58] I'll run namespacedups with --fix [22:17:30] (03CR) 10Thcipriani: [C:03+2] "BACKPORT" [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131784 (https://phabricator.wikimedia.org/T385855) (owner: 10BPirkle) [22:17:43] bpirkle: ^ getting yours merging [22:17:56] thanks [22:18:00] tgr_: want me to sling yours out in the meantime? [22:19:45] albertoleoncio: output of running with --fix: https://phabricator.wikimedia.org/P74490#299127 should be all good, thanks for making the change! [22:20:47] thcipriani: Ok! Thanks for the deploy :-) [22:21:04] (03PS1) 10Reedy: CaptchaPreAuthenticationProvider: Improve log messages [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131844 (https://phabricator.wikimedia.org/T379178) [22:21:04] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1124.eqiad.wmnet with reason: host reimage [22:21:16] thcipriani: sure, thanks! [22:22:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1125.eqiad.wmnet with reason: host reimage [22:22:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131482 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [22:23:20] (03Merged) 10jenkins-bot: Disable new WebAuthn credentials creation on local domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131482 (https://phabricator.wikimedia.org/T378402) (owner: 10Gergő Tisza) [22:23:39] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1131482|Disable new WebAuthn credentials creation on local domains (T378402 T354701)]] [22:23:44] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [22:23:45] T354701: Enable migration of WebAuthn credentials to loginwiki - https://phabricator.wikimedia.org/T354701 [22:24:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1124.eqiad.wmnet with reason: host reimage [22:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10685068 (10phaultfinder) [22:27:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1125.eqiad.wmnet with reason: host reimage [22:28:03] !log thcipriani@deploy1003 tgr, thcipriani: Backport for [[gerrit:1131482|Disable new WebAuthn credentials creation on local domains (T378402 T354701)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:28:35] ^ tgr_ up on mwdebug check please (if possible) [22:30:18] (03Merged) 10jenkins-bot: REST: fix extra routes module localization strings [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131784 (https://phabricator.wikimedia.org/T385855) (owner: 10BPirkle) [22:32:44] thcipriani: works [22:32:59] thanks for the marathon deploy session! [22:33:32] tgr_: thanks for checking, I live to serve :) [22:33:36] going live [22:33:44] !log thcipriani@deploy1003 tgr, thcipriani: Continuing with sync [22:38:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:40:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:40:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1124.eqiad.wmnet with OS bullseye [22:40:30] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1124.eqiad.wmnet with OS bullseye compl... [22:40:42] !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131482|Disable new WebAuthn credentials creation on local domains (T378402 T354701)]] (duration: 17m 03s) [22:40:47] T378402: Disallow setting up new WebAuthn passkeys on Wikimedia wikis - https://phabricator.wikimedia.org/T378402 [22:40:48] T354701: Enable migration of WebAuthn credentials to loginwiki - https://phabricator.wikimedia.org/T354701 [22:40:50] ^ tgr_ all done! [22:41:08] ok bpirkle [22:41:13] ready at long last? [22:41:57] yep [22:42:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:42:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:42:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1125.eqiad.wmnet with OS bullseye [22:43:01] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1125.eqiad.wmnet with OS bullseye compl... [22:43:41] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685138 (10Jclark-ctr) a:05bking→03Jclark-ctr [22:43:46] !log thcipriani@deploy1003 Started scap sync-world: Backport for [[gerrit:1131784|REST: fix extra routes module localization strings (T385855)]] [22:43:50] T385855: REST: Make OpenAPI spec info strings translatable - https://phabricator.wikimedia.org/T385855 [22:43:52] bpirkle: alright, goin' ^ [22:43:53] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685141 (10Jclark-ctr) [22:47:41] thcipriani: I need to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131797 when you're done with that [22:47:54] (03CR) 10Jdlrobson: Web features should not be ambiguously configured (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130771 (https://phabricator.wikimedia.org/T388445) (owner: 10Jdlrobson) [22:48:00] TimStarling: k, this is the last thing I've got, I'll ping you when I'm clear [22:48:08] thanks [22:48:23] !log thcipriani@deploy1003 bpirkle, thcipriani: Backport for [[gerrit:1131784|REST: fix extra routes module localization strings (T385855)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:48:36] ^ bpirkle live on test servers, check please [22:49:39] Looks as expected [22:50:36] thanks for checking, going forward [22:50:41] !log thcipriani@deploy1003 bpirkle, thcipriani: Continuing with sync [22:57:52] !log thcipriani@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131784|REST: fix extra routes module localization strings (T385855)]] (duration: 14m 06s) [22:57:57] T385855: REST: Make OpenAPI spec info strings translatable - https://phabricator.wikimedia.org/T385855 [22:58:16] thcipriani: , thanks for sticking with the extra-long deploy window! [22:58:28] bpirkle: no problem, you're live [22:58:34] TimStarling: you're clear! [22:59:26] thcipriani: did you do the ConfirnEdit patch as well? [22:59:40] thanks -- we are piloting multiblocks after working on it for over a year so we will be cracking out the virtual champagne [22:59:53] kostajh: oh! I thought your message meant that you'd handle that later. [22:59:56] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmEdit/+/1131844 [23:00:06] I can include that one with my deployment if you like [23:00:19] Ah. I thought Reedy or tgr_ could help verify it [23:00:29] TimStarling: that'd be great if that works for you kostajh ? [23:00:33] TimStarling: it would be nice yeah [23:00:35] Thanks [23:01:25] * thcipriani has been deploying for 3 hours...somehow :) [23:01:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [23:01:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131844 (https://phabricator.wikimedia.org/T379178) (owner: 10Reedy) [23:02:33] (03Merged) 10jenkins-bot: Enable Codex and Multiblocks in Polish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131797 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [23:02:45] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye execu... [23:03:15] thanks TimStarling ; thanks and sorry kostajh [23:03:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.399s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:08:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.039s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:10:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1123.eqiad.wmnet with OS bullseye [23:10:42] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685189 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye [23:13:54] (03Merged) 10jenkins-bot: CaptchaPreAuthenticationProvider: Improve log messages [extensions/ConfirmEdit] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131844 (https://phabricator.wikimedia.org/T379178) (owner: 10Reedy) [23:14:08] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1131797|Enable Codex and Multiblocks in Polish wiki (T377121)]], [[gerrit:1131844|CaptchaPreAuthenticationProvider: Improve log messages (T379178)]] [23:14:14] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:14:15] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [23:19:06] !log tstarling@deploy1003 tstarling, hmonroy, reedy: Backport for [[gerrit:1131797|Enable Codex and Multiblocks in Polish wiki (T377121)]], [[gerrit:1131844|CaptchaPreAuthenticationProvider: Improve log messages (T379178)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:19:27] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:07] kostajh: live on the test servers, but I don't need your confirmation given how harmless your patch looks [23:20:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1123.eqiad.wmnet with reason: host reimage [23:23:54] !log tstarling@deploy1003 tstarling, hmonroy, reedy: Continuing with sync [23:24:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1123.eqiad.wmnet with reason: host reimage [23:24:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:30:51] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131797|Enable Codex and Multiblocks in Polish wiki (T377121)]], [[gerrit:1131844|CaptchaPreAuthenticationProvider: Improve log messages (T379178)]] (duration: 16m 42s) [23:30:57] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:30:57] T379178: Support captcha as part of login flow (not just on "badlogin") - https://phabricator.wikimedia.org/T379178 [23:40:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:40:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:40:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1123.eqiad.wmnet with OS bullseye [23:40:40] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host elastic1123.eqiad.wmnet with OS bullseye compl... [23:40:56] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685268 (10Jclark-ctr) [23:41:42] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install elastic112[345] - https://phabricator.wikimedia.org/T387356#10685269 (10Jclark-ctr) 05Open→03Resolved [23:43:44] !log Doing some load testing on mwdebug1001 [23:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:56] !log zabe@mwmaint1002:~$ cat group1.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/afl_text_table_deletedump/{} --sleep 0.3" # T381599 [23:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:00] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [23:48:48] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10685275 (10Eevans) Making use of the existing folder seems most reasonable to me, but I'll give others the chance to weigh in in case there were good reasons for doing it this way (I su... [23:48:52] (03PS1) 10Ladsgroup: maintenance: Add support for unlocking accounts in LockUser.php [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131861 [23:48:55] (03CR) 10Ladsgroup: [C:03+2] maintenance: Add support for unlocking accounts in LockUser.php [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131861 (owner: 10Ladsgroup) [23:51:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131861 (owner: 10Ladsgroup) [23:57:06] (03PS2) 10Krinkle: CentralAuth: lower timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [23:57:11] (03CR) 10Krinkle: [C:03+1] CentralAuth: lower timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [23:57:52] (03PS3) 10Krinkle: CentralAuth: lower timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [23:58:01] (03Merged) 10jenkins-bot: maintenance: Add support for unlocking accounts in LockUser.php [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1131861 (owner: 10Ladsgroup) [23:58:15] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1131861|maintenance: Add support for unlocking accounts in LockUser.php]]