[08:31:48] hnowlan: o/ I should have checked the datacenter.py config, didn't really think about the new LVS VIP, but very happy that Kartotherian can now work active/passive :D [08:39:16] filed https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129184 to remove the exception :) [09:26:26] <_joe_> elukey: that's great! [10:31:13] elukey: <3 [11:43:04] sre folks, I have a maintenance script in an extension to run, but I'm coming up blank from wikitech and not having a lot of luck digging through the mwscript-k8s codebase. How do I specify an extension script? [11:43:28] would it be `Extension:script`, `Extension:script.php`, `extensions/Extension/maintenance/script.php`...? [11:47:16] zip: all of these should work iirc [11:47:28] let me check something [11:49:18] <_joe_> zip: hwo would you do it using mwscript (non-k8s)? [11:49:57] Full path should definitely work [11:54:26] iirrc the script_name argument is passed directly as first argument to MwScript.php, so it should support the same invocations as the old mwscript [12:03:45] 👋 I'd like to backport a fix for a train blocker, wanted to check in with you folks in case some infra work is still happening [12:12:27] _joe_: full path, I looked this up earlier [12:12:50] And then I just pass --wiki as though it was an argument to my actual script? [12:13:05] I'm doing the dry run first so I guess I could see if that runs [12:15:51] jnuche: the MW datacentre switchover will be happening at 1400 [12:17:52] hnowlan: that would be 2 hours from now right? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400 [12:18:03] I should have enough time, but I'll keep an eye out for that, thx [12:19:34] yep, just giving a heads-up [12:21:05] <_joe_> zip: same call then [13:19:23] zip: did you get it sorted? [13:19:41] mwscript extensions/ExtensionName/maintenance/script.php --wiki=foowiki [13:19:43] should work fine in the current [13:20:01] it'll whinge about not using run.php, but ignore that [13:27:43] Reedy: just freshly back lunch, bout to give it a go [13:30:04] well, first I need to figure out what machine to do it on [13:32:40] reminder about the Mediawiki switchover at 1400 if you're running a script that will be running long [13:32:45] It *will* be stopped [13:33:48] zip: deployment host, but yeah be mindful of what hnowlan just said [13:34:29] it should (a) be quick and (b) not mutate anything [13:34:45] is this the twice-yearly datacentre switch? [13:34:50] yes [13:35:25] neat [13:35:32] possibly I should wait for the backport window to be over [13:35:55] the switchover is happening immediately after the backport window unfortunately [13:36:06] hm. well... it's a dry-run [13:36:11] `mwscript-k8s --comment="T380911 dry run" -f -- extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php --dry-run --wiki=office` [13:36:11] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [13:36:32] reckon we'll be okay, or should I wait? [13:36:47] for now I just wanted to paste some output into the ticket, and run for real later today [13:39:12] I think you should wait. [13:39:30] sure [13:49:17] it looks like puppet runs on the deployment servers are failing after the changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1094531 [13:49:45] hnowlan: ^ [13:49:49] that is not ideal timing [13:50:05] We can revert [13:50:35] I think that's the only option cc akosiaris [13:51:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129269 [13:52:00] * akosiaris fixing [13:52:26] I 'll revert and remerge later on, no time to fix actually [13:52:26] it's just missing a 'group' resource [13:52:37] but yeah please revert for now [13:52:40] yeah and the patch is ready [13:52:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129239 [13:53:00] but pointless to discuss that one right now [13:53:28] I've got some reimages to do; I'm guessing at this point I should wait until after the switchover? [13:53:33] yes please [13:53:48] I 'll bypass CI I think for the revert [13:53:54] okay, thanks [13:54:14] hnowlan: Coordination will be here or on -operations? [13:57:07] switchover starting in a few minutes [13:57:14] revert merged, double checking by issuing a puppet agent -t on deploy hosts [13:57:18] good luck team! [13:57:29] akosiaris: thanks [13:58:18] sorry for failing to fix it in time [14:00:23] hnowlan: what's your tmux [14:01:10] claime: tmux -S /tmp/hnowlan-switchover attach -t switchover [14:01:28] ty [14:03:01] attach-session -r everyone please [14:03:06] yes please ^ [14:03:10] anyway, here we go [14:04:05] what's your leader key [14:04:16] I can't detach [14:04:20] <_joe_> no need to warmuo caches btw [14:04:20] ctrl-a [14:04:23] to reattach ro [14:04:42] _joe_: okay, just don't run that cookbook? [14:04:56] or will it return immediately? [14:05:09] just skip it [14:05:10] <_joe_> it should ask you if you really need to [14:05:14] ack [14:05:15] <_joe_> I'd just skip it [14:05:34] confirmation to proceed please? :) [14:05:41] <_joe_> don't skip the ttl reduction though [14:05:42] <_joe_> :) [14:06:06] <_joe_> go, it will make you wait [14:06:12] can people please stop resizing :| [14:06:39] I attached read only [14:06:51] <_joe_> me too [14:06:52] thanks [14:08:19] next time we should aggressive-resize off [14:09:32] I put together a switchover TL;DR dashboard for anyone who wants just the highlights https://grafana-rw.wikimedia.org/d/ef6dcf76-29f3-437d-947a-73c76d9e367d/datacentre-switchover?orgId=1&refresh=1m&from=now-1h&to=now [14:09:52] nice [14:09:59] Cool, thanks effie [14:12:21] okay, safe to proceed? [14:12:49] hnowlan: good from my side [14:12:50] go [14:14:54] root@deploy2002:~# kube-env admin codfw; kubectl -n mw-cron delete cronjobs --all [14:14:57] cronjob.batch "mediawiki-main-serviceops-version" deleted [14:14:57] for reference [14:15:11] alright, okay for read-only? [14:15:14] let's go [14:15:17] 👍 [14:15:53] <_joe_> go [14:15:54] Checked all jobs/pods/cronjobs at 0, go [14:15:56] 2 to 8 all in a quick succession if successful, as always. [14:16:07] sounds off [14:16:29] yep, eswiki is RO [14:17:23] same itwiki [14:17:46] 🤌 [14:17:53] sound on [14:17:54] sounds [14:17:55] and back :) [14:17:58] yaaaaaay [14:18:00] nice [14:18:00] <_joe_> yeahh [14:18:04] RW again [14:18:05] <_joe_> let's look at metrics [14:18:12] eswiki getting writes too [14:18:19] yeah enwiki is back [14:19:06] mw-web metrics in eqiad looking good [14:19:14] yeah surprisingly good [14:19:25] and surprisingly fast [14:19:27] mw-api int spiking [14:19:30] <_joe_> metrics look cool [14:19:41] steady at 60% sat [14:19:41] <_joe_> claime: spiking because it's getting POST request? [14:19:45] jobrunner issues? [14:20:01] could be restarted related [14:20:02] claime: it had quite a lot of traffic on codfw ~5k rq/s [14:20:11] it's just getting post requests from codfw [14:20:18] well formerly from codfw [14:20:21] within expectations [14:20:26] eqiad db masters looking fine [14:20:29] <_joe_> hnowlan: how's jobrunner affected? [14:20:36] <_joe_> please share graphs [14:20:45] <_joe_> anyone looking at mw errors? [14:20:50] api-ext looking good [14:20:51] this error spike https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=codfw+prometheus%2Fops&viewPanel=18&from=1742392244926&to=1742394044926 [14:20:55] already over [14:21:01] eqiad db traffic https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=es6&var-shard=es7&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-shard=s8&var-shard=x1&var-shard=x2&var-role=All&from=now-30m&to=now [14:21:37] <_joe_> hnowlan: probably jobs still running in codfw that failed because of ro database? [14:21:44] lol the diff [14:21:45] yeah [14:21:50] <_joe_> just an hypothesis [14:21:58] claime: XD [14:22:20] we should probably inhibit it somehow [14:22:37] <_joe_> claime: what? [14:22:52] _joe_: the helmfile diff for mwcron [14:23:06] test cronjob recreated btw [14:23:10] <_joe_> ahh ok I was looking at dashboard [14:23:11] nice [14:23:14] <_joe_> and logstash [14:23:23] <_joe_> everything seems... boring? [14:23:34] boring is good [14:23:41] I love boring [14:24:11] _joe_: just as you said that somewhere a JCB driver in virginia just got the strangest urge [14:24:26] haha [14:24:27] ahahahha [14:24:32] updating the master DB records [14:24:51] k [14:25:04] ro time this time? [14:25:15] I think we may have a record [14:25:36] 02:24.495804 from logged timestamps [14:25:54] AntiComposite: thank you [14:26:05] akosiaris: as the official RO timekeeper, do you agree? [14:26:13] ahahahaha [14:26:16] when did I get that hat? [14:26:22] I tend to rely on volans for that every time [14:26:24] Not a new record [14:26:30] :D [14:26:30] <_joe_> switchover time, offered by akosiaris [14:26:36] but no, it's not a record [14:26:39] but it's very good [14:26:42] <_joe_> there is no point in chasing records [14:26:49] <_joe_> we're also checking more now than in the past [14:26:50] akosiaris: you may have not noticed, you almost always you are the one announcing the RO time [14:26:53] I've a UI question [14:27:09] Oh come on volans, we are celebrating! [14:27:10] (record is 1:57 afaict) [14:27:15] <_joe_> volans: you are not allowed any UX opinions after what you've done with the cumin "captcha" [14:27:27] (but we don't care, anything under 3 minutes is *awesome*) [14:27:35] marostegui: "ma che cosa dici Riccardo??" [14:27:36] during the RO periood I tried to edit itwiki (visual editor) and I got a popup saying: Unable to stash Parsoid HTML (cancel retry) [14:27:37] _joe_: as the owner of one of the longest RO duration, it is just pure curiosity :p [14:27:49] <_joe_> effie: I'm the owner of the longest [14:27:52] while wikitext edit was showing the usual banner [14:28:00] <_joe_> and no one can take those 40 minutes away from me [14:28:05] <_joe_> I should add it to my CV [14:28:09] _joe_: does nto count, you had like 10000 less automations [14:28:14] fewer* [14:28:16] on record yes. But it's faidon that holds the very first one and we didn't count. It was a very long one. [14:28:22] effie: that's no-way to refer to newer colleagues ;p [14:28:29] lol [14:28:33] elukey: phone [14:28:40] <_joe_> Emperor: ahahahahaha [14:28:54] Congratulations hnowlan [14:29:01] well done! [14:29:02] ty! glad it went well [14:29:02] very nice! :) [14:29:02] congrats all, nice switch! [14:29:10] hnowlan: <3 [14:29:15] congrats! :D [14:29:21] nice job team! [14:29:28] <_joe_> ok now we can blow the rest of the error budget in cowboy deployments [14:29:33] lmao [14:29:42] * marostegui nervous seeing how hnowlan still have the 09-run-puppet-on-db-masters line ready but not yet done [14:29:43] very "was that it?" vibes :D [14:29:45] DB reords changed [14:29:57] TheresNoTime: that's what we're going for :D [14:30:07] <_joe_> TheresNoTime: that's the sign SRE has done a good job [14:30:08] marostegui: was just waiting for the authdns run to finish :D [14:30:17] <_joe_> our best work is not noticed [14:30:18] absolutely! :) [14:30:20] hnowlan: Ah ok! Thanks :) [14:30:57] yeah authdns-update is taking a while these days due to the ever-increasing zones. will put it in our to-do to look [14:31:03] checking puppet failure on deploy1003 [14:31:26] gg hnowlan ! [14:31:28] (thankfully geodns depool is separate from that now :) [14:35:02] edit to edit on wikidata was 02:02 (20250319141542 - 20250319141744) [14:36:05] puppet failure on deploy1003 was transient and unrelated, run just finished clean [14:36:29] yeah, the "official" number is always more than the actual one as we allow for some leeway for edits to finish. [14:36:57] there was someone a few switchovers ago that was watching the global rcstream and came up with a more accurate number [14:37:47] it was interesting to see the difference in measurement methodologies (and more importantly the fact that they cared so much for this). Need to dig that correspondence [14:39:05] yes the official value from the cookbooks is the worse case scenario, on average it's slightly quicker [14:39:15] any reason why we are seeing the "technical maintenance" notice in english? It is already translated [14:39:27] marostegui: ahahahhahahaahahah [14:39:45] irc.wikimedia.org was also working as expected during the failover (it's the first time switchover since we use Faidon's new stack instead of the old patched-ratbox crap) [14:39:55] neat! [14:40:56] hnowlan: great work! <3 [14:40:57] Nemoralis: if you could create a phab task with a report (if you have a screenshot all the merrier) of what you saw, we can route it to the proper people to look into it. [14:41:10] elukey: <3 [14:41:14] {◕ ◡ ◕} [14:41:15] cookbook all done [14:41:22] unlocking scap [14:41:26] congrats all, and thanks for your hard work [14:48:12] thank you, hnowlan [14:51:55] well done hnowlan :) [14:57:09] akosiaris: https://phabricator.wikimedia.org/T389371 [14:58:26] Nemoralis: I've mentioned that to movement comms, hopefully they can route appropriately [14:58:49] Nemoralis: thanks! [15:38:39] MediaWiki read-only period starts at: 2025-03-19 14:15:30.955779 and MediaWiki read-only period ends at: 2025-03-19 14:17:55.451583. For those that want the absolutely (im)precise to the μicrosecond timestamps :P [15:39:26] akosiaris: https://xkcd.com/2170/ :P [15:39:50] lol [15:43:07] haha, infrastructure club slack found out about the datacenter switchover and are enthusing about it a little [15:45:42] anyway, would I be good to run my scripts for T380911 after the product/tech meeting? I'll run all the dry-runs, pop the output in the ticket, review carefully and then do the real deal with a colleague [15:45:42] T380911: Run Flow migration script at *Phase 2b* wikis - https://phabricator.wikimedia.org/T380911 [15:46:49] zip: yep, should be fine. I'm just running a backport atm, but after that we're back to normal [15:47:32] I appreciate it [15:47:49] also it's handy to do this right after the datacenter move, so if I fuck up it'll look like it's not my fault [15:48:07] * claime frowns [15:48:10] :p [15:48:50] :D [15:58:27] <_joe_> zip: you've been onboarded well [17:08:05] hmmm puppetmaster_web_frontend spec seems to be upset [17:08:13] https://www.irccloud.com/pastebin/H7yNZzMb/ [17:08:29] jhathaway ^^ [17:08:43] vgutierrez: thanks [17:09:33] where did this pop up? [17:11:09] running ./utils/run_ci_locally.sh while testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129326/ [17:11:26] nod, thanks [17:11:33] I'm guessing CI will fail due to that for my change [17:13:23] hmm or not [17:14:43] strange [17:17:03] jhathaway: CI ran way less tests than my local run [17:17:27] so the test is broken it didn't get executed during that CI test for my CR on gerrit [17:17:35] *but it didn't.. [17:27:52] I had the same error while testing mine change, fixing the (unrelated) bit let it disappear [17:28:13] I was confused because that didn't involved in any way my change [18:58:53] Hi. I wonder if anyone can help. I've got a pcc error against all snapshot servers, unable to find the spiderpig user: https://puppet-compiler.wmflabs.org/output/1129343/5108/ [19:00:54] btullis: that was broken earlier and then fixed -- try rebasing and see if that fixes you? [19:01:19] Ah, OK. I thought I was branched off the current production. Will try again. [19:01:20] (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129239 was the fix, as long as you rebase beyond that you should be good) [19:01:29] I wonder if this would need syncing puppet facts to the compiler instances [19:01:44] try rebase first, of course [19:03:56] Yeah, same failure. [19:05:05] mutante: Is it this one? https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_production - Or I could just wait for a timer? [19:05:09] mutante: I may have conflicted with your authdns-update run, sorry! [19:05:40] mutante: hm, I don't know offhand why this would need a manual update, but we can try it [19:06:35] I also don't know enough about the underlying work to know if this is plausibly an actual puppet bug instead of just a PCC artifact [19:06:37] digging a little [19:06:38] bblack: I think I am actually already merged and running it a second time was nothing on my side :) [19:06:55] btullis: maybe, just maybe. ssh pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor [19:07:01] https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [19:07:40] mutante: Trying it now. Thanks. [19:07:59] Looks hopeful. [19:09:04] btullis: there is "upload_puppet_facts systemd timer." but not sure how often it runs [19:09:35] rzl: I can paste the command output, if it would help. But it mentions scap and spiderpig. [19:10:32] anyone here related to the "jaeger" service by the way? you now have .svc. names in codfw [19:10:57] so far it was eqiad-only afaict [19:10:58] But no. My pcc run still fails. [19:11:29] I think that c.danis is the most closely affiliated with jaeger. [19:11:58] and is on sabbatical, yeah [19:12:07] if you need anything in the meantime, ##wikimedia-tracing is the channel [19:12:40] ok, thanks. I dont think I need something specific, but I will report it there. [19:13:03] I just ran across this when trying to add a second service to k8s-aux cluster [19:16:16] re: spiderpig issue. it seems like the puppetdb (pcc-db1002) in the puppet-diffs project needs to FORGET about the resource User spiderpig [19:16:33] btullis: this might be too obvious, but there is no user 'spiderpig' on that host? this might be related to the revert I saw in the morning but I know nothing beyond that [19:16:35] that code in admin just goes through the lis tof all system users [19:16:43] mutante: yeah [19:16:49] and spiderpig isnt one anymore [19:16:54] but apparently was for a short time [19:17:07] but that project with the compilers has a local puppetdb [19:17:25] yeah I think it's an actual puppet bug in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129289/2/modules/admin/data/data.yaml [19:17:27] what I dont know is how and when it would forget resources [19:17:48] we add spiderpig to the deployment group everywhere, but spiderpig only exists on the deploy hosts, not e.g. the snapshot hosts [19:17:57] (even though the deployment group does) [19:18:02] cc thcipriani in case you're around [19:18:09] (not your code but you might have some context) [19:18:35] so, not a pcc issue at all, puppet's just actually broken [19:19:53] ah yea, this makes sense [19:20:00] OK, Many thanks for looking. [19:20:36] we could revert https://gerrit.wikimedia.org/r/c/1129289 again, or we could take https://gerrit.wikimedia.org/r/c/1129239 further and create the user everywhere else it's needed -- I'm inclined to revert, I just want to make sure I know what'll happen if we do [19:21:20] we could only remove the spiderpig user from the deployment group but keep the part where it gets reserved as system user [19:21:36] as a quick fix without another full revert [19:22:05] cc dancy also, in case you're around :) [19:24:07] the net impact is puppet hasn't run on any of the remaining bare-metal MW hosts in about three hours -- I'm comfortable with just reverting if no one is around with domain knowledge [19:24:42] we can give it a try in #developer-experience in Slack [19:25:53] o/ [19:25:57] reading [19:26:02] that worked:) [19:26:47] I am thinking if we just remove the spiderpig user from group "deployment" but leave the rest as is then it should fix the issue without reverting it all. [19:27:18] like you would keep the reserved system user, just it couldnt actually deploy, which it doesnt do yet anyways? [19:30:10] That sounds good to me mutante [19:34:22] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129356 [19:36:23] rzl: are you ok with that approach? [19:38:01] sure [19:38:31] ack! deploying [19:39:11] not sure if you also need to drop the spiderpig from the line after that [19:39:36] ah :) you'll find out I guess [19:40:20] eh.. I ..dont ..think so.. but good point. [19:40:38] sorry meeting, reading [19:41:04] assumption: a user in sudo files that doesnt exist should not break stuff [19:41:43] re-compiling the puppet change on snapshot and running puppet on deploy1003 [19:42:13] btullis: https://puppet-compiler.wmflabs.org/output/1129343/5111/ [19:42:27] hrm, the privileges we need in the deploy group are the first one, mainly (`www-data,mwdeploy,scap) NOPASSWD: ALL` is that true, dancy ? [19:42:47] removing from the group for now is preferrable [19:43:07] and we can figure out if there are permissions in that group that are needed and followup [19:43:28] nothing is broken in the immediate term by removing spiderpig from the deployment group (afaiu) [19:45:50] compiling on snapshot hosts works again. running puppet on snapshot1010 as well. it added the spiderpig user to /etc/sudoers.d/deployment even though it does not exist and that does not seem to be a problem [19:47:53] [cumin1002:~] $ sudo cumin 'R:Group = deployment' 'run-puppet-agent -q --failed-only' [19:48:15] ^ running this now to fix puppet run on 32 nodes (also see https://puppetboard.wikimedia.org/) [19:51:13] rzl: I think that's it for now. only 13 failed maps hosts now [19:52:00] 👍 [19:52:04] unrelated "parameter 'realserver_ips' variant 1 expects a Hash value, got Array" :p