[07:09:02] morning! [07:10:12] klausman: I think it is a performance issue hardcorded in the kubelet, maybe they don't want the risk to be swapped out of memory? (Without locking the memory used to be unswappable etc..) [07:10:35] anyway, if we want to reimage the ctrl nodes we could spend a little time trying to make Bullseye working [07:11:01] there should be a few packages to copy to bullseye-wikimedia, and nothing more [07:11:07] I can quickly check the puppet code [07:11:17] so we'd be the first ones running bullseye [07:59:14] o/ good morning! [08:01:01] morning! [08:04:33] so I see [08:04:34] kubernetes-master | 1.16.15-4 | bullseye-wikimedia | component/kubernetes116 | amd64 [08:04:44] that should be what we need for the k8s master daemons [08:05:14] yeah and puppet is updated [08:05:24] so in theory the reimage to bullseye should just work klausman [09:51:02] Noted. [09:51:20] I'll that in a hot minute (once my brain is fully booted). Also, mornong :) [09:56:41] o/ [10:07:05] Got a bit carried away last night on my gaming group's music chat and only got to bed at 2am [10:07:27] Turns out, I'm not 20 anymore, when doing that was nbd [10:07:44] But now I have tea and the neurons are starting to fire [10:08:16] elukey: so just doing the usual cumin reimage cookbook would do the trick (after setting them to boot the bullseye installer) [10:09:24] klausman: no no these are VMs and it is more manual, IIRC the coobook doesn't support them [10:09:32] but I could be wrong [10:09:35] Ok, I'll dig on WT then [10:09:53] I mean I just did the install a few days ago, I should remember :D [10:10:48] https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM There we go [10:12:54] yes this works for sure [10:13:16] one more thing, if we want bullseye - we'll need to change the dhcp config in puppet for the staging ctrl vms [10:13:37] Actually they already are on bullseye [10:14:19] ah perfect! [10:14:24] then it is just a reimage [10:14:42] caveat - their puppet certs need to be cleaned/signed [10:14:55] cleaned before the reimage, signed after it [10:15:13] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200#871 [10:15:20] basically the node generates a client cert after the first puppet run, and sends a csr to the puppet master [10:15:27] Yeah, the puppet cert dance I am aware of. [10:15:32] perfect [10:21:29] Is there an easy way to see (during the install) if the no-swap part of the partman recipe was picked up [10:22:01] in theory no, but I have tested the recipe in several vms [10:22:57] the first time to see it is when running install_console [10:24:31] Ah, actually, there is a way [10:25:04] as soon as the partitioning has run, you can open one of the shells on the installer, and run `blkid`, which shows you what partitions there are. [10:27:06] https://phabricator.wikimedia.org/P22828 <- before and after partitioning/mkfs [10:35:17] mmm without stopping d-i? [10:35:27] sure [10:35:36] no idea how to do it! [10:35:39] ctrl-a 2 will get you the second console [10:35:46] TIL thanks! [10:35:47] Like a physical Alt-2 would [10:37:04] Ok, reinstall done, certs signed, doing initial puppet runs [10:38:24] I think the WMF netinstall image basically wraps d-i in screen(1) to enable this kind of stuff [10:38:56] I've done _magical_ things this way with partition layouts and the like for private servers. Granted, *black* magic, but still. [10:46:00] Ok, all done, including a post-puppet-run reboot [10:46:22] Now to make the patch for actually putting k8s stuff on the machines [11:22:23] elukey: Looking at https://netbox.wikimedia.org/ipam/prefixes/377/prefixes/ - would we use a /24 each (i.e. 80, 81) for the staging service IPs and pod IPs? [11:23:26] 77 and 78-79 are going to go away with the renumbering, so I am not sure if we should leave a a gap there in the future [11:24:14] OTOH, re-doing the IPs for the staging cluster after renumbering the above-mentioned should not be too disruptive because it's just staging. [11:32:26] klausman: for the moment we are not ready to reuse the subnets so I'd choose the first ones that are free [11:32:56] Ack. that would be 80/81. Think /24 is enough for pods and services? [11:38:57] maybe /23 for services, just in case we want to do extra testing etc..? [11:39:04] (many revisions etc..) [11:40:30] going out for lunch! [11:52:52] bon appetito [11:53:13] and yes, /23 sounds good. will do the NB thing after lunch [14:01:39] elukey: akosiaris mentioned that we could just use a slice out of the /16 Arzhel signed off on and keep all our IP ranges "together" that way [14:02:02] (from T302701) [14:04:54] could be an option yes [14:08:54] I'm a bit confused about the /16 though. It sounds like they want us to pick one /18 inside of it and then slice that up however we want. Am I reading that right? [14:09:33] I think so yes, more than a /18 would mean a ton of ips [14:11:09] So a /18 is 4 /20 or 8 /21. We could use the first /20 and the /21 after it for prod, and the last 23 and the preceding /24 for staging. We'll only be able to dodge fragmentation for so long, though. [14:18:23] https://phabricator.wikimedia.org/P22831 basically like this [14:19:46] (I'll check in a bit) [14:19:55] disregard, there be a mistake there [14:22:50] and fixed [14:23:02] the other thing to keep in mind is that we'll have the train/dse cluster soon to build, that will be shared with data engineering [14:23:17] so if we find some space for it as well it would be great [14:23:35] Yeah, I think there would still be plenty of space. Would it be in codfw or in eqiad? [14:24:47] eqiad [14:25:32] Would we allocate prod-sized ranges or staging-size ones? I'd presume the former? [14:26:11] yes definitely, we'll need to run full kubeflow on it [14:26:19] plus possibly other DE-related things [14:26:24] Ok, I'll update the paste with a plan for that [14:33:16] And updated, including a rationale [15:06:18] reviewing the paste :) [15:11:06] calculating subnet ranges on a friday afternoon is not the best [15:15:38] klausman: overall it seems good, maybe try to implement it in netbox's ipam so we can ask to netops/serviceops people a final +1 ? [15:17:01] Does it have a review mode? I thought allocation edits were immediately active [15:17:54] (calculating IP ranges: use ipcalc or ipcalc-ng :)) [15:18:38] sure but it doesn't really trigger any side effects, it just allocates ips in a dedicated /18 in netbox [15:18:56] Ah, sure, I can do that [15:19:02] super [15:27:45] https://netbox.wikimedia.org/ipam/prefixes/530/prefixes/ and https://netbox.wikimedia.org/ipam/prefixes/535/prefixes/ [15:31:15] klausman: looks good! Let's wait for feedback in the task! [15:31:15] Also added to T302701 [16:00:25] ah I just noticed that we lost wikibugs (so no phab updates in here) [16:04:33] Is it a bad sign that I didn't notice? [16:07:01] nono I was wondering why I didn't see my last phab update, and I noticed that the bot is down everywhere [16:07:14] no idea how to respawn it, there is doc but never done it [16:20:46] Morning all! Today is a US holiday but I’m around if you need anything. Actually I’m just sitting here doing my taxes [16:23:02] morning! [16:29:36] \o Heya Chris [16:29:55] I need to do my 2021 taxes soon, but... motivation [16:40:48] started https://gerrit.wikimedia.org/r/c/operations/puppet/+/771947 the preliminary work on ORES to support buster [16:58:39] lmk if-when you want a review on that [17:02:51] Moritz is going to add python3-scipy to the component, after that I think we can spin up a cloud vm and test it [17:03:02] (something like the beta vm that we have) [17:09:34] :+1: [18:15:05] * elukey afk! [18:15:08] have a good weekend folks!