[09:22:15] <arturo>	 hello there, I'm not sure who is online today that can review/approve this:
[09:22:16] <arturo>	 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203
[09:24:27] <taavi>	 looking
[09:25:49] <taavi>	 arturo: something seems to have happened to the comments with `tofu plan`, they are very hard to read :/
[09:26:22] <arturo>	 taavi: I know. The gitlab support for code blocks inside comments, and specifiicaly, inside collapsed notes, is very limitted
[09:27:02] <taavi>	 is there a reason why there's a rule for v4-traffic from the dualstack network but not for v6-traffic from it?
[09:27:29] <arturo>	 no, I think that's a good point, I will add it now
[09:33:27] <arturo>	 done
[09:37:53] <arturo>	 the tofu plan diff is so big because rules are reordered in the tofu state array, so instead of seeing +1 rule, it sees a massive array sorting operation
[09:40:07] <taavi>	 does that mean tofu will delete and re-create those rules? :(
[09:40:40] <taavi>	 we should maybe give those resources stable names so that does not happen in the future
[09:42:01] <arturo>	 yes, instead of an array, use a map
[09:42:05] <arturo>	 that's the way to avoid this
[09:43:45] <arturo>	 yes, rules will be re-created
[09:43:56] <arturo>	 also, the openstack API doesn't help either
[09:44:14] <arturo>	 as a simple description change forces a replacement
[09:44:16] <arturo>	 https://www.irccloud.com/pastebin/fShXOoJN/
[10:53:57] <arturo>	 taavi: I need to merge that tofu-infra patch to test a few things
[10:59:43] <taavi>	 arturo: ack. do you plan to fix https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203 to use a map or leave that for later?
[11:00:30] <arturo>	 taavi: We would need to relocate the state objects. There be dragons, we can do later.
[11:01:24] <taavi>	 yeah. i'm not super happy about re-creating some of those rules, but guess that it's fine
[11:02:37] <taavi>	 that diff is really annoying to parse though
[11:02:42] <taavi>	 anyway, approved
[11:02:57] <arturo>	 thanks
[11:04:52] <arturo>	 tofu apply just completed, no issues
[11:10:49] <taavi>	 anything left to do in T380728?
[11:10:49] <stashbot>	 T380728: openstack: network problems when introducing new networks - https://phabricator.wikimedia.org/T380728
[11:12:02] <arturo>	 no, I will close it
[11:14:33] <arturo>	 taavi: I would like to merge this one next: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/204
[11:15:03] <arturo>	 I'm undecided if we need to schedule an operation window
[11:16:18] <taavi>	 maybe not, but i'd also not deploy that on a day when this many people are not around
[11:20:49] <arturo>	 fair
[11:33:41] <arturo>	 today is a global holiday in the WMF, so I guess we will introduce this change wednesday
[12:03:53] <arturo>	 topranks: does this looks good to you? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/205
[12:14:28] <taavi>	 i left a comment
[12:25:10] <arturo>	 thanks
[12:31:38] <andrewbogott>	 I think today isn't a global holiday but tomorrow is?
[12:31:53] <andrewbogott>	 But today is probably a holiday in most european countries so it might as well be :)
[12:33:24] <arturo>	 andrewbogott: if you are online today, then I will definitely reconsider the IPv6 thing :-)
[12:34:12] <andrewbogott>	 I will be a bit distracted but am definitely working today.
[12:39:25] <topranks>	 arturo: hey, I'm not working today 
[12:39:41] <topranks>	 that dns change looks correct though, for the two /64s in question 
[12:39:44] <topranks>	 we need to merge this one:
[12:39:45] <topranks>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/1113527
[12:40:16] <topranks>	 which delegates the entire /56, but as I recall from last time designate will only set up zones for the subnets in use.  should be fine I think.
[12:42:59] <andrewbogott>	 We should probably wait until Cathal is also around before we roll things out.
[13:29:27] <arturo>	 ack
[14:55:39] <arturo>	 chuckonwu: I just made a small change to https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 in order to make the pipeline green, after that its all yours
[14:58:48] <chuckonwu>	 👍 arturo, I'm watching the changes 
[15:05:02] <taavi>	 andrewbogott: I left comments on https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps/-/merge_requests/2. note how that branch was rebased to pick up the new CI pipeline
[15:05:41] <taavi>	 not touching the tofu-cloudvps patch yet, that will need to be merged/tagged first
[15:05:50] <arturo>	 chuckonwu: I have now merged https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 and I will stop doing more changes. All yours now
[15:07:50] <chuckonwu>	 Thanks arturo
[15:07:51] * arturo notices the typo in the commit title too late
[18:36:46] <taavi>	 andrewbogott: it seems like both quarry workers have now filled their disks :(
[18:39:21] <taavi>	 let's see what happens when I tell magnum to provision more nodes
[18:44:00] <andrewbogott>	 did rebooting really help before, or were we fooling ourselves?
[18:45:14] <taavi>	 it helped temporarily, but did not address the root cause
[18:48:17] <taavi>	 andrewbogott: unsurprisingly trying to resize the cluster has done absolutely nothing
[18:48:35] <andrewbogott>	 that's interesting. No warning or anything?
[18:48:41] <taavi>	 nothing i can see
[18:48:58] <taavi>	 i don't really see a way forward except somehow getting shell access to the cluster, or just completely nuking it and creating a fresh one
[18:50:06] <andrewbogott>	 I can get console access but that only helps if they have  default account + pwd...
[18:50:10] * andrewbogott checks the console just in case
[18:51:10] <andrewbogott>	 yeah
[18:51:10] <andrewbogott>	 quarry-127a-g4ndvpkr5sro-master-0 login:
[18:52:58] <andrewbogott>	 actually magnum thinks it's resizing...
[18:53:12] <andrewbogott>	 node count 4, update in progress
[18:53:18] <Rook>	 I can't tell if yinz have kubectl access but if you do you can launch a debug pod to get host access with chroot /host
[18:55:18] <andrewbogott>	 I think we do -- taavi, do you?
[18:57:21] <andrewbogott>	 taavi, the kube config is quarry-bastion.quarry.eqiad1.wikimedia.cloud:/home/rook/quarry/tofu/kube.config
[18:57:24] <taavi>	 yeah, let me try that
[18:59:03] <taavi>	 aha
[18:59:06] <taavi>	 found the issue
[18:59:09] <taavi>	 quarry is leaking tmp files
[18:59:18] <andrewbogott>	 so that's why rebooting helped
[18:59:26] <taavi>	 for the reference: $ kubectl debug node/quarry-127a-g4ndvpkr5sro-node-0 -it --image debian:stable 
[19:00:16] <taavi>	 worker-0 should have more free space now
[19:00:19] * andrewbogott thinks there must be newer quarry docs than https://wikitech.wikimedia.org/wiki/Quarry/Quarry_maintenance_and_administration
[19:01:11] <taavi>	 hmm, apparently i can't launch a debug pod on worker-1 because its disk is too full
[19:02:32] <andrewbogott>	 want me to reboot it?
[19:02:46] <andrewbogott>	 Rook: are there admin docs anywhere?
[19:02:51] <taavi>	 sure, we can give it a try
[19:03:44] <Rook>	 Probably in the readme would be the most updated 
[19:04:33] <andrewbogott>	 'k
[19:05:36] <andrewbogott>	 taavi: did the reboot help? Looks like you already tried a few minutes ago
[19:09:31] <taavi>	 andrewbogott: no
[19:10:32] <andrewbogott>	 I can't tell if it actually rebooted
[19:11:45] <andrewbogott>	 looks like it did
[19:16:54] <andrewbogott>	 time to redeploy, or do you still have ideas?
[19:19:41] <taavi>	 nope
[19:21:28] <andrewbogott>	 do you think we can/should try to fix the leak before we deploy?
[19:21:47] <taavi>	 nah, we can do it later
[19:25:15] * andrewbogott increases to 4 workers while we're at it
[19:25:58] <taavi>	 don't think that's really necessary
[19:26:01] <taavi>	 3 maybe, but 4 seems overkill
[19:26:51] <andrewbogott>	 we are not short on compute resources!
[19:27:09] <andrewbogott>	 but 3 is fine with me, I just want us to have room to maneuver
[19:30:20] <andrewbogott>	 can you tell why it can't push to quay?  Expired token maybe?
[19:31:22] <taavi>	 what? where?
[19:32:14] <andrewbogott>	 https://github.com/toolforge/quarry/pull/77
[19:33:17] <taavi>	 no idea, and i don't seem to be a member of https://quay.io/organization/wikimedia-quarry
[19:33:46] <andrewbogott>	 me neither I think
[19:33:56] <andrewbogott>	 rook, can you add us?
[19:34:35] <Rook>	 I have no idea if I still have access to that. It isn't in some larger wiki group?
[19:35:15] <andrewbogott>	 I thought it would be but seems not
[19:36:23] <Rook>	 Oh I do still have access. Let's see...
[19:39:00] <Rook>	 Ok andrewbogott taavi did yinz get an invite link or something?
[19:39:15] <andrewbogott>	 yes
[19:40:52] <Rook>	 Excellent 
[19:42:26] <andrewbogott>	 and now I can push!
[19:44:45] <andrewbogott>	 Rook: What happens if I deploy.sh? Will it delete and replace the existing deployment? Are we set up to do a proper blue/green in quarry or do you usually just delete/replace?
[19:49:09] * andrewbogott is going to find out
[19:50:22] * andrewbogott predicts that this will do nothing at all
[20:09:54] <andrewbogott>	 indeed
[20:10:26] <andrewbogott>	 so now I'm stuck on the question: Is this stateless enough that I can just delete the magnum cluster and start over? I'm pretty sure the answer is 'yes' but I don't like deciding that on my own
[20:11:06] <Rook>	 Yeah it will just deploy to as usual. Has the same blue green deploy as paws. You have to setup the new cluster first for a blue green 
[20:12:11] <andrewbogott>	 how do I tell it to deploy to a new cluster rather than update the existing one?
[20:12:18] <Rook>	 I believe you can do a usual blue green without much more than people needing to log back in. The state lives in NFS 
[20:13:26] <Rook>	 Like paws. Duplicate the tf file that deploys the cluster update the name. Be sure to remove the kube config from the current one
[20:14:41] <andrewbogott>	 'k
[20:27:00] <andrewbogott>	 hm, this is going very poorly so far
[20:27:54] <andrewbogott>	 network name changes
[20:36:05] <andrewbogott>	 now new cluster shows as create_in_progress which seems hopeful
[21:14:17] <andrewbogott>	 taavi: I've deployed the new three-node cluster and pointed quarry.wmcloud.org  at it. It seems... fine? If it stays fine for a day or two 
[21:14:30] <andrewbogott>	 I'll tear down the old one and get these (minor) changes merged.
[21:14:54] <andrewbogott>	 bd808: I'm also interested in your opinion about the current state since you were first to notice last time Quarry broke
[21:19:35] <bd808>	 I noticed because I watch the Phabricator feed for cloud things. 
[21:22:09] <bd808>	 It looks like stuff is happening at https://quarry.wmcloud.org/query/runs/all 
[21:32:15] <andrewbogott>	 Rook: predictably, all of your deployment code worked like magic once I caught up with the new network  name. I'd really appreciate it if you read through my hurriedly-written docs about blue/green deployment in https://github.com/toolforge/quarry/pull/79 and comment if any of what I'm saying sounds wrong. No rush on that though!
[21:33:36] <andrewbogott>	 I'm especially interested in if I have defied convention in my understanding of which is blue and which is green
[21:34:36] <andrewbogott>	 thank you for the after-hours work taavi!
[21:34:51] <andrewbogott>	 I'm going to go take a walk if things don't crash in the next 3-4 minutes