[01:34:07] <andrewbogott>	 ta.avi: turns out that 'helm push' just plain doesn't work with chartmuseum; you have to install a special plugin and then do 'helm cm-push'.  ¯\_(ツ)_/¯
[07:35:00] <Rook>	 Could I get some access to check on PAWS?
[07:52:23] <dcaro>	 hi! \o
[07:52:56] <dcaro>	 I think andrewbogot.t was double checking with legal for NDAs and such to be all in order, on jupyterhub you have admin already (as far as I can see)
[07:53:26] <dcaro>	 hmm, I think it's your WMF account though
[08:22:34] <Rook>	 Cool, good to know. Yeah I don't think it's my current account, and admin on that level doesn't really let you do much admin stuff. It only gives you some interface admin abilities.
[08:37:44] <RhinosF1>	 Anyone want to merge & deploy https://github.com/toolforge/paws/pull/488/ for https://phabricator.wikimedia.org/T394614 while you're talking about PAWS?
[08:41:54] <taavi>	 i'll look after CI passes
[08:45:31] <RhinosF1>	 taavi: it passed
[08:45:47] <dcaro>	 taavi: the dualstack network for VMs is `VXLAN/IPv6-dualstack`, with the IPv6  prefix right?
[08:45:56] <taavi>	 dcaro: yes
[08:46:04] <dcaro>	 okok, I'll fix the cookbook
[08:46:08] <taavi>	 what's broken?
[08:46:19] <dcaro>	 k8s node creation
[08:46:22] <dcaro>	 (for toolforge)
[08:46:23] <taavi>	 RhinosF1: if I say I'm looking at something you don't need to ping me a minute after it happens
[08:47:12] <taavi>	 "This branch cannot be rebased due to conflicts" thanks github
[08:48:05] <taavi>	 aaand now we wait for the pipeline again
[08:48:44] <taavi>	 I also love how github just runs the "build and push container image" pipeline on its own but I specifically need to approve running helm lint
[08:49:08] <RhinosF1>	 I've never found the approve running workflows very useful
[08:57:11] <Rook>	 yinz can enable universal running of pipelines on pr. There is some security concern of intentionally malicious code being run in an attempt to get access to things like tokens for quay.io so I set it to only known people should it run automatically, though that was if I recall a little odd
[09:06:30] <taavi>	 RhinosF1: deployed
[09:06:45] <RhinosF1>	 thanks!
[09:19:00] <dcaro>	 hmm... the cookbook failed to create a new k8s nfs worker, I think it might have gotten confused by the os drive being `sdb` instead of `sda`, and also the puppetmaster did not change and it tried to ssh to it
[09:19:22] <taavi>	 can you paste the full output?
[09:21:51] <dcaro>	 https://www.irccloud.com/pastebin/DtUUxEVz/
[09:27:05] <taavi>	 looking at the VM console in horizon, the puppet issue is also caused by that coming up using /dev/sdb as the root disk instead of sda
[09:27:13] <taavi>	 the kernel does that sometimes for some reason
[09:27:39] <dcaro>	 hmm.... so there's some puppet code depending on sdb
[09:28:01] <dcaro>	 got the fix on the cookbook side (untested) https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1173897
[09:28:45] <dcaro>	 looking into the puppet one
[09:29:19] <taavi>	 the puppet issue is literally just a check that `sda` cannot be a cinder volume: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/cinderutils/manifests/ensure.pp#72
[09:32:37] <dcaro>	 hmmm
[09:33:10] <dcaro>	 the cookbook mounts the volume, so puppet will not get to that point after, but it has to run before I think
[09:33:24] <dcaro>	 let me check/try
[09:33:35] <taavi>	 the initial puppet run runs before that part of the cookbook runs
[09:35:18] <dcaro>	 hmmm.... the cookbook also seems to expect it to be already mounted, so the patch I sent will not help
[09:35:25] <dcaro>	 looking to fix on the puppet side
[09:35:50] <taavi>	 i wonder whether it's easier to just get rid of this vm and try again, if this is the first time out of hundreds of k8s workers that we've seen this happen
[09:36:58] <dcaro>	 the drive flip has happened also with ceph osds and such quite commonly
[09:37:01] <dcaro>	 but sure
[09:37:09] <dhinus>	 I don't think I've ever seen it on openstack vms though
[09:37:32] <dhinus>	 but the reason could be the same
[09:38:59] <taavi>	 this is why you rely on drive/part uuids everywhere you can, not drive letters
[09:39:20] <dhinus>	 I just checked a random VM and /etc/fstab has a UUID for the root drive
[09:39:59] <dhinus>	 does it get created by that file you linked? (ensure.pp)
[09:40:11] <taavi>	 root drive? definitely not
[09:40:23] <taavi>	 that class is for automatic cinder volume mounting
[09:40:40] <dhinus>	 ok, makes sense. I didn't even know it existed :)
[09:40:51] <taavi>	 (it's a wrapper for wmcs-prepare-cinder-volume, which then does use UUIDs for things it adds to fstab)
[09:41:03] <taavi>	 but the root volume is handled somewhere in cloud-init iirc
[09:41:05] <dcaro>	 yep, not sure we can know the uuid in advance
[09:41:41] <dcaro>	 hmm, what about just dropping the `!= sda/vda` restriction? if it's not mounted, that should be enough no?
[09:42:03] <dcaro>	 worst case scenario, I can re-check if it has partitions
[09:42:26] <dhinus>	 I was thinking the same...
[09:42:27] <taavi>	 hmm
[09:42:30] <taavi>	 i guess that's fine
[09:42:38] <taavi>	 although wmcs-prepare-cinder-volume then has that same check
[09:42:47] <dhinus>	 :/
[09:43:22] <dcaro>	 xd
[09:44:15] <dhinus>	 I have a vague memory that this drive ordering issue started with bookworm, but I might be wrong
[09:44:40] <taavi>	 "predictable drive lettering"
[09:44:43] <dcaro>	 hmm.... in which case is the root drive not mounted, and puppet running?
[09:44:59] <dhinus>	 taavi: yes
[09:45:10] <taavi>	 dcaro: i don't think that can happen
[09:45:26] <dcaro>	 so we can safely remove that from the prepare script too
[09:45:34] <dcaro>	 (imo)
[09:46:00] * dhinus agrees, but is also a little scared :P
[09:46:14] <dcaro>	 it already skips if it's mounter, if it has children, or if it's formatted
[09:46:45] <dhinus>	 yes that's true
[09:47:28] <taavi>	 this feels scary but that seems correct
[09:50:36] <dhinus>	 off topic, I would like to deploy https://github.com/toolforge/paws/pull/495 -- should I deploy and merge, or merge and deploy?
[09:51:15] <taavi>	 deploy and merge is what I've been doing
[09:51:22] <dhinus>	 ack
[09:51:28] <taavi>	 sorry, merge and deploy
[09:51:38] <dhinus>	 ah ok, otherwise I had a follow-up question
[09:51:54] <dhinus>	 but if we're all fine with merge and deploy it sounds easier
[09:52:27] <Rook>	 If that is the current preference the docs may need updated, it was built around deploy, verify, merge as things get rebuild on a revert
[09:53:13] <dcaro>	 ^ I think deploy then merge is a nicer workflow
[09:53:28] <dhinus>	 I'm not finding that in the docs anymore, though. I remember it was like that for Quarry
[09:53:51] <Rook>	 https://wikitech.wikimedia.org/wiki/PAWS/Admin mentions it for cluster updates.
[09:54:22] <dhinus>	 Rook: ah yes, "When you are satisfied with this Merge the PR."
[09:54:34] <Rook>	 👍
[09:54:42] <dhinus>	 then my follow-up question is, how do we avoid clashes when checking out branches in the bastion?
[09:55:01] <dhinus>	 i.e. how do we avoid someone checking out a different branch just while I'm deploying?
[09:55:34] <dhinus>	 I guess we could coordinate in this IRC channel but that feels suboptimal
[09:56:01] <Rook>	 In the years that I had been working on it, that had never happened. I had considered it, but the lack of it happening resulted in it being a "I guess we'll look at that if we come to it" kind of problem
[09:56:21] <dhinus>	 I think you were doing 99% of deploys though :)
[09:56:42] <Rook>	 True, maybe it is now an issue and thus deserves attention
[10:09:31] <dhinus>	 ok I'm gonna try deploy and merge to see how that works
[10:12:31] <dhinus>	 I had to "git checkout branchname + git fetch + git reset --hard origin/branchname", which is a bit cumbersome
[10:12:39] <dhinus>	 but that worked
[10:13:30] <dhinus>	 I also have to remember to merge the PR, or the next deploy will undo my change
[10:14:04] <dhinus>	 overall, I think I prefer the merge+apply workflow, but that's just my personal preference
[10:15:57] <Rook>	 `git pull ; git checkout <branchname>` doesn't work? 
[10:16:47] <Rook>	 I would agree merge then apply is easier to get to an apply, though if the apply fails, and much of the verification of success is post apply, reverting is not very pretty from that state
[10:17:18] <Rook>	 Meanwhile if an apply doesn't work on a branch a revert is just `git checkout main; deploy.sh ...`
[10:17:29] <dcaro>	 it also leaves more commits that did not work in the history
[10:19:44] <dcaro>	 you can also `git fetch && git reset --hard origin/your_custom_branch && deploy.sh...` on the main branch directly, and for revert `git reset --hard origin/main && deploy.sh...`, if the merge is fast-forward, the main branch will not need anything after merge (git fetch will just show the main branch up to date)
[10:20:13] <dhinus>	 Rook: yes I think if I did "pull" first, I would have not needed reset
[10:20:50] <dhinus>	 agreed about the ease of revert, but I'm not sure if it's enough of a "pro", against the various "cons"
[10:20:58] <dhinus>	 maybe if we had more automation I could change my mind :)
[10:21:18] <Rook>	 And the revert will rebuild everything in that day's perspective, so it remains possible that an unrelated library would get added and break things. There were limitations in github actions on preventing ci from running when using the pullrequesttarget options
[10:22:04] <Rook>	 I'm confused by the need for the resets. A branch is prevented from merging unless it is updated to main already, so shouldn't everything just be an extension of main when being deployed regardless?
[10:22:32] <dhinus>	 I think that's because I had to rebase that branch
[10:23:01] <dhinus>	 again something that does not happen if you have a single developer, but becomes common when you have multiple people merging&deploying
[10:23:40] <dhinus>	 when I did "checkout branchname", it checked out the tip of the branch _before_ my rebase
[10:24:06] <Rook>	 Oh I see, I think I don't notice because I have a function for `git checkout main ; git branch -D $1 ; git pull ; git checkout $1`
[10:24:45] <dhinus>	 yes I think if we want to use this workflow we need to have some form of automation, at least some simple scripts
[10:25:18] <Rook>	 Which part should have additional automation?
[10:25:48] <dhinus>	 separate topic: my change removing "NodePort" made Helm default to a service of type "LoadBalancer", which interestingly interacted with Octavia and created an actual LoadBalancer in OpenStack :)
[10:26:21] <Rook>	 That's neat
[10:26:41] <dhinus>	 I was pleasantly surprised, but also I don't think we need one in this instance, as it's exposed through the ingress
[10:26:51] <taavi>	 huh
[10:26:55] <dhinus>	 I'll check if I can force the service to be "type: ClusterIP"
[10:26:57] <taavi>	 I think that should be a type: ClusterIP instead
[10:27:16] <dhinus>	 taavi: yes, I just need to figure out how to convince Helm to create it as such
[10:27:44] <dhinus>	 https://z2jh.jupyter.org/en/stable/resources/reference.html#proxy-service-type
[10:28:02] <dhinus>	 this is where I should NOT have merged so I could patch my PR :P
[10:28:24] <taavi>	 well you had it applied to the cluster at some point, so I think it's good that it's in the history
[10:29:09] <dhinus>	 true that, and also a good example why I don't see a big issue with having some more commits in the git history, if they were applied at some point
[10:29:22] <dhinus>	 I'm going for lunch, will open a new PR later to set it to ClusterIP
[11:08:47] <dcaro>	 It's a commit we should not go back to, imo the got history serves better as points we know the cluster worked, not as SAL registry (that fits better somewhere else imo)
[12:38:42] <dcaro>	 fyi creating another tools nfs worker passed now 🎉
[12:49:04] <dhinus>	 can I get a +1 here? https://github.com/toolforge/paws/pull/496
[12:52:00] <taavi>	 yes
[12:53:19] <dcaro>	 and now it failed in the same way
[12:55:34] <dhinus>	 dcaro: it's strange that we never had this problem and now we get it twice in a row... but I don't have an explanation
[13:04:57] <dhinus>	 the paws change worked: the service is now ClusterIP, and the Octavia load balancer was deleted
[14:29:38] <dcaro>	 I was able to create two workers in a row now :/
[14:30:05] <dcaro>	 (~66% success rate)
[15:14:48] <dcaro>	 random note, I've been using vscodium for a few days, so far so good :)
[15:18:50] <dcaro>	 quick review? https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1173890
[15:18:53] <dcaro>	 (tested already)
[15:19:06] <dcaro>	 "server_create: use the right name for the dualstack network"
[15:20:34] <taavi>	 whoops, ship it
[15:22:06] <dcaro>	 thanks!
[15:30:29] <andrewbogott>	 should I restart those stuck tools-nfs workers?
[15:31:32] <dcaro>	 andrewbogott: yes please, I'm waiting for the day that I get too annoyed and do some automation for it
[15:31:42] <andrewbogott>	 ok!
[15:32:08] <dcaro>	 maybe that'd be a good first task, to create a cookbook to restart all stuck workers
[15:51:05] <andrewbogott>	 yeah, could be
[15:57:18] <dcaro>	 it could potentially be the first one to use prometheus to fetch the list of nodes to act on
[16:43:08] * dhinus off
[17:08:16] * dcaro off
[17:08:18] <dcaro>	 cya tomorrow!