[09:22:15] <arturo> hello there, I'm not sure who is online today that can review/approve this: [09:22:16] <arturo> https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203 [09:24:27] <taavi> looking [09:25:49] <taavi> arturo: something seems to have happened to the comments with `tofu plan`, they are very hard to read :/ [09:26:22] <arturo> taavi: I know. The gitlab support for code blocks inside comments, and specifiicaly, inside collapsed notes, is very limitted [09:27:02] <taavi> is there a reason why there's a rule for v4-traffic from the dualstack network but not for v6-traffic from it? [09:27:29] <arturo> no, I think that's a good point, I will add it now [09:33:27] <arturo> done [09:37:53] <arturo> the tofu plan diff is so big because rules are reordered in the tofu state array, so instead of seeing +1 rule, it sees a massive array sorting operation [09:40:07] <taavi> does that mean tofu will delete and re-create those rules? :( [09:40:40] <taavi> we should maybe give those resources stable names so that does not happen in the future [09:42:01] <arturo> yes, instead of an array, use a map [09:42:05] <arturo> that's the way to avoid this [09:43:45] <arturo> yes, rules will be re-created [09:43:56] <arturo> also, the openstack API doesn't help either [09:44:14] <arturo> as a simple description change forces a replacement [09:44:16] <arturo> https://www.irccloud.com/pastebin/fShXOoJN/ [10:53:57] <arturo> taavi: I need to merge that tofu-infra patch to test a few things [10:59:43] <taavi> arturo: ack. do you plan to fix https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/203 to use a map or leave that for later? [11:00:30] <arturo> taavi: We would need to relocate the state objects. There be dragons, we can do later. [11:01:24] <taavi> yeah. i'm not super happy about re-creating some of those rules, but guess that it's fine [11:02:37] <taavi> that diff is really annoying to parse though [11:02:42] <taavi> anyway, approved [11:02:57] <arturo> thanks [11:04:52] <arturo> tofu apply just completed, no issues [11:10:49] <taavi> anything left to do in T380728? [11:10:49] <stashbot> T380728: openstack: network problems when introducing new networks - https://phabricator.wikimedia.org/T380728 [11:12:02] <arturo> no, I will close it [11:14:33] <arturo> taavi: I would like to merge this one next: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/204 [11:15:03] <arturo> I'm undecided if we need to schedule an operation window [11:16:18] <taavi> maybe not, but i'd also not deploy that on a day when this many people are not around [11:20:49] <arturo> fair [11:33:41] <arturo> today is a global holiday in the WMF, so I guess we will introduce this change wednesday [12:03:53] <arturo> topranks: does this looks good to you? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/205 [12:14:28] <taavi> i left a comment [12:25:10] <arturo> thanks [12:31:38] <andrewbogott> I think today isn't a global holiday but tomorrow is? [12:31:53] <andrewbogott> But today is probably a holiday in most european countries so it might as well be :) [12:33:24] <arturo> andrewbogott: if you are online today, then I will definitely reconsider the IPv6 thing :-) [12:34:12] <andrewbogott> I will be a bit distracted but am definitely working today. [12:39:25] <topranks> arturo: hey, I'm not working today [12:39:41] <topranks> that dns change looks correct though, for the two /64s in question [12:39:44] <topranks> we need to merge this one: [12:39:45] <topranks> https://gerrit.wikimedia.org/r/c/operations/dns/+/1113527 [12:40:16] <topranks> which delegates the entire /56, but as I recall from last time designate will only set up zones for the subnets in use. should be fine I think. [12:42:59] <andrewbogott> We should probably wait until Cathal is also around before we roll things out. [13:29:27] <arturo> ack [14:55:39] <arturo> chuckonwu: I just made a small change to https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 in order to make the pipeline green, after that its all yours [14:58:48] <chuckonwu> 👍 arturo, I'm watching the changes [15:05:02] <taavi> andrewbogott: I left comments on https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps/-/merge_requests/2. note how that branch was rebased to pick up the new CI pipeline [15:05:41] <taavi> not touching the tofu-cloudvps patch yet, that will need to be merged/tagged first [15:05:50] <arturo> chuckonwu: I have now merged https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/2 and I will stop doing more changes. All yours now [15:07:50] <chuckonwu> Thanks arturo [15:07:51] * arturo notices the typo in the commit title too late [18:36:46] <taavi> andrewbogott: it seems like both quarry workers have now filled their disks :( [18:39:21] <taavi> let's see what happens when I tell magnum to provision more nodes [18:44:00] <andrewbogott> did rebooting really help before, or were we fooling ourselves? [18:45:14] <taavi> it helped temporarily, but did not address the root cause [18:48:17] <taavi> andrewbogott: unsurprisingly trying to resize the cluster has done absolutely nothing [18:48:35] <andrewbogott> that's interesting. No warning or anything? [18:48:41] <taavi> nothing i can see [18:48:58] <taavi> i don't really see a way forward except somehow getting shell access to the cluster, or just completely nuking it and creating a fresh one [18:50:06] <andrewbogott> I can get console access but that only helps if they have default account + pwd... [18:50:10] * andrewbogott checks the console just in case [18:51:10] <andrewbogott> yeah [18:51:10] <andrewbogott> quarry-127a-g4ndvpkr5sro-master-0 login: [18:52:58] <andrewbogott> actually magnum thinks it's resizing... [18:53:12] <andrewbogott> node count 4, update in progress [18:53:18] <Rook> I can't tell if yinz have kubectl access but if you do you can launch a debug pod to get host access with chroot /host [18:55:18] <andrewbogott> I think we do -- taavi, do you? [18:57:21] <andrewbogott> taavi, the kube config is quarry-bastion.quarry.eqiad1.wikimedia.cloud:/home/rook/quarry/tofu/kube.config [18:57:24] <taavi> yeah, let me try that [18:59:03] <taavi> aha [18:59:06] <taavi> found the issue [18:59:09] <taavi> quarry is leaking tmp files [18:59:18] <andrewbogott> so that's why rebooting helped [18:59:26] <taavi> for the reference: $ kubectl debug node/quarry-127a-g4ndvpkr5sro-node-0 -it --image debian:stable [19:00:16] <taavi> worker-0 should have more free space now [19:00:19] * andrewbogott thinks there must be newer quarry docs than https://wikitech.wikimedia.org/wiki/Quarry/Quarry_maintenance_and_administration [19:01:11] <taavi> hmm, apparently i can't launch a debug pod on worker-1 because its disk is too full [19:02:32] <andrewbogott> want me to reboot it? [19:02:46] <andrewbogott> Rook: are there admin docs anywhere? [19:02:51] <taavi> sure, we can give it a try [19:03:44] <Rook> Probably in the readme would be the most updated [19:04:33] <andrewbogott> 'k [19:05:36] <andrewbogott> taavi: did the reboot help? Looks like you already tried a few minutes ago [19:09:31] <taavi> andrewbogott: no [19:10:32] <andrewbogott> I can't tell if it actually rebooted [19:11:45] <andrewbogott> looks like it did [19:16:54] <andrewbogott> time to redeploy, or do you still have ideas? [19:19:41] <taavi> nope [19:21:28] <andrewbogott> do you think we can/should try to fix the leak before we deploy? [19:21:47] <taavi> nah, we can do it later [19:25:15] * andrewbogott increases to 4 workers while we're at it [19:25:58] <taavi> don't think that's really necessary [19:26:01] <taavi> 3 maybe, but 4 seems overkill [19:26:51] <andrewbogott> we are not short on compute resources! [19:27:09] <andrewbogott> but 3 is fine with me, I just want us to have room to maneuver [19:30:20] <andrewbogott> can you tell why it can't push to quay? Expired token maybe? [19:31:22] <taavi> what? where? [19:32:14] <andrewbogott> https://github.com/toolforge/quarry/pull/77 [19:33:17] <taavi> no idea, and i don't seem to be a member of https://quay.io/organization/wikimedia-quarry [19:33:46] <andrewbogott> me neither I think [19:33:56] <andrewbogott> rook, can you add us? [19:34:35] <Rook> I have no idea if I still have access to that. It isn't in some larger wiki group? [19:35:15] <andrewbogott> I thought it would be but seems not [19:36:23] <Rook> Oh I do still have access. Let's see... [19:39:00] <Rook> Ok andrewbogott taavi did yinz get an invite link or something? [19:39:15] <andrewbogott> yes [19:40:52] <Rook> Excellent [19:42:26] <andrewbogott> and now I can push! [19:44:45] <andrewbogott> Rook: What happens if I deploy.sh? Will it delete and replace the existing deployment? Are we set up to do a proper blue/green in quarry or do you usually just delete/replace? [19:49:09] * andrewbogott is going to find out [19:50:22] * andrewbogott predicts that this will do nothing at all [20:09:54] <andrewbogott> indeed [20:10:26] <andrewbogott> so now I'm stuck on the question: Is this stateless enough that I can just delete the magnum cluster and start over? I'm pretty sure the answer is 'yes' but I don't like deciding that on my own [20:11:06] <Rook> Yeah it will just deploy to as usual. Has the same blue green deploy as paws. You have to setup the new cluster first for a blue green [20:12:11] <andrewbogott> how do I tell it to deploy to a new cluster rather than update the existing one? [20:12:18] <Rook> I believe you can do a usual blue green without much more than people needing to log back in. The state lives in NFS [20:13:26] <Rook> Like paws. Duplicate the tf file that deploys the cluster update the name. Be sure to remove the kube config from the current one [20:14:41] <andrewbogott> 'k [20:27:00] <andrewbogott> hm, this is going very poorly so far [20:27:54] <andrewbogott> network name changes [20:36:05] <andrewbogott> now new cluster shows as create_in_progress which seems hopeful [21:14:17] <andrewbogott> taavi: I've deployed the new three-node cluster and pointed quarry.wmcloud.org at it. It seems... fine? If it stays fine for a day or two [21:14:30] <andrewbogott> I'll tear down the old one and get these (minor) changes merged. [21:14:54] <andrewbogott> bd808: I'm also interested in your opinion about the current state since you were first to notice last time Quarry broke [21:19:35] <bd808> I noticed because I watch the Phabricator feed for cloud things. [21:22:09] <bd808> It looks like stuff is happening at https://quarry.wmcloud.org/query/runs/all [21:32:15] <andrewbogott> Rook: predictably, all of your deployment code worked like magic once I caught up with the new network name. I'd really appreciate it if you read through my hurriedly-written docs about blue/green deployment in https://github.com/toolforge/quarry/pull/79 and comment if any of what I'm saying sounds wrong. No rush on that though! [21:33:36] <andrewbogott> I'm especially interested in if I have defied convention in my understanding of which is blue and which is green [21:34:36] <andrewbogott> thank you for the after-hours work taavi! [21:34:51] <andrewbogott> I'm going to go take a walk if things don't crash in the next 3-4 minutes