[00:44:42] * bd808 off [07:59:01] this is great https://kubernetes.io/blog/2024/04/17/kubernetes-v1-30-release/ [08:00:20] UwU [08:16:15] seems like an easy upgrade for us (when we get there anyway) [08:31:12] uwu good mwoning sirs [08:31:30] https://www.irccloud.com/pastebin/pFNTL4Mm/ [08:38:09] when you have an "upgrade to UwU" task, you don't want to postpone it :D [08:45:35] kubectl should be renamed to uwu: `uwu get pods` [08:50:41] please approve https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/262 [08:51:09] approved [08:51:38] thanks [08:53:11] hmm, why are we leaking fullstackd instance dns records? [08:53:23] blancadesal: and kubecon to uwucon? :D [08:53:51] kuwucon? [08:53:55] xd [08:58:39] taavi: I also noticed a DNS alert yesterday (assuming it was the same thing), but now I don't see any? [08:59:44] dhinus: if you scroll up in the prometheus-node-textfile-wmcs-dnsleaks.service service logs you'll see the previous run did flag a fullstackd instance. now I don't know if that was just a delayed automatic delete or whether someone manually deleted it [09:06:46] * dhinus is happy that logstash is finally using the idp auth instead of basic auth [09:07:16] I ran the dns leak cleanup script yesterday [09:07:21] I couldn't find that logline in logstash for some reason, but I found it with journald [09:09:15] in logstash I also found that "puppetserver ca clean" failed around the same time with "remote host identification has changed" for host fullstackd-20240418081442.admin-monitoring.eqiad.wmflabs [09:09:30] taavi: any idea why this could be happening? https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/262#note_79201 [09:09:45] which is odd because I don't expect we reuse a hostname like that one, with a timedate in it? [09:09:49] taavi: from this change https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/77/diffs [09:09:59] old version (or new version) of toolforge-weld [09:10:57] right [09:11:01] arturo: ApiData is new in toolforge-weld 1.5.0, but the poetry lock in there only has 1.4.0 [09:11:05] we should probably pin it to a minimum/maximum, and create a pipeline to create an MR to update it on all the projects that need it [09:11:12] now the more interesting question is why did CI not fail with that [09:11:17] ^ /me got excited with pipelines last week [09:11:53] the mypy pre-commit hook should probably run with the versions of dependencies in the poetry lockfile and not just the latest as it does now [09:12:34] please review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/78 [09:12:49] taavi: that's an issue yes [09:12:58] arturo: -1, please run `poetry update` afterwards [09:13:17] or `poetry add toolforge-weld@^1.5.0` or similar [09:14:14] ack [09:14:46] taavi: updated the MR [09:16:18] re: the "puppetserver ca clean" seems unrelated to the DNS leak, because it is logged every day and not just today [09:18:40] dcaro: thanks [09:22:32] np 👍 [09:24:27] I really like how k8s did not deploy the new pods because they were crashlooping, and left the api online with the older pods [09:25:37] :) [09:27:26] do we set `--wait` in the helmfile command? that should help detect when that happens [09:40:57] I tink we do not [09:41:00] *think [09:45:59] dcaro: will you be at collab? I might need some help with the oapi-server liveness probe/healthz thing. I skipped it for now to incorporate the merging logic [09:46:13] blancadesal: I can make it yes [09:50:27] please approve https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/263 [09:58:04] approved [10:01:55] thanks [10:02:04] BTW I'm still having hardware problems :-( [10:02:24] they replaced the graphic card of my laptop, but apparently is not enough [10:03:19] oh no :( [10:06:57] I think they are going to replace the motherboard next [10:07:45] if they keep replacing parts, it will eventually be a fully new laptop :-P [10:08:30] spanish lenovo consumer hardware support seems more efficient than Dell :-P [10:09:21] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/37 [10:09:33] ^ scheduled pipeline to update poetry deps [10:09:53] oooh yes xd [10:12:41] dcaro: LGTM [10:12:47] no newline at the end of file though [10:13:11] we don't have pre-commit there or something? I'll add that too [10:13:52] btw. /me curious, what's the issue with the lack of newline? as in, why is that a best practice? [10:14:48] I believe is from the POSIX standard [10:16:31] https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline has some additional context [10:16:31] Yeah, I think historical reasons, and tools like awk, sed, diff tools etc. are designed to expect a newline [10:18:14] found https://www.baeldung.com/linux/files-end-with-newlines#when-it-matters-and-why [10:18:22] same xd [10:28:18] this adds pre-commit (it also required a couple fixes) https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/38 [10:29:39] we got 3 poetry autoupdate mrs \o/ https://gitlab.wikimedia.org/groups/repos/cloud/-/merge_requests?scope=all&state=opened&label_name[]=Needs%20review&approved_by_usernames[]=None [10:29:43] (running it once a month) [10:30:39] it seems that you can't choose the pipeline to run directly on the scheduled pipeline, it will run everything, so we might want to use some variable to chose one or the other if running both autoupdate mr scripts feels like too much at once [11:37:35] I just created another decision request ticket: T362872 [11:37:36] T362872: Decision Request - Toolforge policy agent enforcement model - https://phabricator.wikimedia.org/T362872 [11:41:26] 👍 [11:41:35] quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/38 (adding pre-commit to cicd repo) [11:42:22] LGTM [13:06:49] andrewbogott: do we have an unpuppetized debian image in codfw1dev? [13:08:32] I'm not sure, I'll check in a bit [13:09:02] context: messing with openvswitch, I need something that doesn't complain about the lack of metadata/internet connectivity during boot [13:16:47] hm, I had a single VM schedule properly but now I'm back at the `Stderr: iptables-restore v1.8.9 (nf_tables): interface name `74ab55ca-0cb1-4669-998c-3c86912a3e32' must be shorter than IFNAMSIZ (15)` issue I was at previously [13:35:27] Oh man, my laptop just snow crashed! Except instead of the whole screen it was just the mouse cursor buffer, so for a few seconds I had a little 1" patch of snow that I could drag around the screen until the whole system locked up [13:37:19] taavi: bullseye-raw is unpuppeized but it's pretty old, I'll upload a modern bookworm image [13:49:46] taavi: now there's debian-12.0-nopuppet [13:53:10] andrewbogott: thanks! in the meantime I found the issue with security groups I think [13:57:24] great! We needed that image in codfw1dev anyway. [13:57:57] andrewbogott: do you know if it's possible to log on the serial console for that image? I'm getting a password prompt :/ [13:58:26] you need to launch with a keypair, the pub key will get associated with username 'debian' [13:58:51] (using the horizon keypair dialog which you are likely in the habit of ignoring as I am) [13:59:00] the issue is that I don't have proper networking connectivity on these instances yet [13:59:07] aaah I see [13:59:21] um... I don't know if it comes with a default console password. [13:59:59] quick google suggests that they don't have a default password, just cloud-init [14:00:08] ok, I'll figure out something. getting some network connectivity on it is next up on the list regardless [14:00:28] I can upload a 'nocloud' image just for you which will have a default password [14:00:37] want me to do that or are you already past this problem? [14:00:53] if that's not too much effort, yes please [14:01:12] I can't promise that it'll work but it's easy to try [14:01:16] what project are you using? [14:01:39] `taavitestproject` [14:02:27] the magic command to create an instance with all the required network settings is this: https://phabricator.wikimedia.org/P60933 [14:02:57] and the firewall fix is this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021484 [14:08:02] taavi: ok, now you can use debian-12.0-nocloud, login as 'root', no password required. [14:11:11] andrewbogott: thanks! I'll go eat something and will try it after that [14:11:26] sounds good, I'm about to go eat breakfast too [14:52:58] blancadesal: an asynchronous alternative to python requests might be https://www.python-httpx.org/, my partner uses it for their work [14:53:04] they seem happy with it [14:58:21] dcaro: good idea! thanks :) [15:00:25] dcaro: can you follow up on T361366? I just want to make sure that three big osd nodes is what you need to test what you want to test in codfw1dev [15:02:34] andrewbogott: ack, lookynig [15:46:04] andrewbogott: I'm a bit confused about the task sorry xd [15:46:14] is that for 3 big servers on eqiad1? [15:46:48] (8 drives for osds of ~3.8T each looks like) [15:47:18] there's definitely too much info. [15:47:49] I was going to order two replacement osd nodes for codfw1dev and you asked me to get more so you could test HA and other real world things in codfw1dev [15:47:52] https://maelvls.dev/docker-proxy-registry-kind/ <-- an idea about caching with kind [15:47:59] (for lima-kilo) [15:48:18] So now the order is for 3 of the same nodes we're getting for eqiad. My question is if that's the right call, or if we should get e.g. more smaller servers. [15:48:43] That's clearly a lot more space than we'll actually use but it seems nice to have them the same as in eqiad1 [15:49:41] I see, so those servers will end up in codfw? [15:50:22] hmmmm [15:50:26] arturo: FYI https://phabricator.wikimedia.org/T358761#9727060 [15:50:34] I clearly linked the wrong task! Hang on while I dig [15:50:48] xd [15:51:18] the comment in the task though seems to point that those are meant to be in codfw [15:51:23] taavi: thanks. Amazing to read in 1 minute a summary of your many days of research and hard work :-P [15:52:06] yeah, I'm confused or the task title is confused [15:52:56] ok, it's me who is confused... we do have procurement coming up for codfw1dev but not until 2025. [15:53:08] in any case, for codfw we need ethier 1 or 5 hosts (for 4 HA zones), and they can be way way smaller (2os + 2osd hard drives of <1T is enough) [15:53:17] shoot :/ [15:53:27] arturo: thanks :D it took me an embarrassingly long time and way too much neutron code reading to figure out it was a single config setting that I had not updated [15:53:28] So I guess now I have a different question, which is: is refreshing only three of those old eqiad nodes going to be a problem? [15:53:45] I think it's three because that's how many we started with for our initial pilot so they're aging out together. [15:54:02] taavi: honestly, upstream could add an "if" statement somewhere to at least log something if the wrong driver is used [15:54:36] andrewbogott: it's going to be ok, would be better having 4 to grow all the HA zones at the same time, but as long as we keep an eye on the imbalance having a bit of it should be ok [15:54:46] ok [15:55:00] taavi: this category of things (what would be the name?) is the kind of things that I dislike about openstack [15:55:21] dcaro: I commented on task, sorry for the incoherent ping [15:55:37] not being robust and explicit regarding configuration combinations, I guess [15:55:56] maybe that's related to the huge number of combinations that can exists [15:56:29] andrewbogott: the task seems to say that they have 1 network connection only though [15:56:30] [15:57:15] arturo: isn't k8s an even wider possible combination of different swappable components and configs? Or has the k8s world gravitated to a small known set of practices? [15:57:19] dcaro: looking [15:57:49] andrewbogott: yes, but is way more robust when reporting invalid config combinations in my opinion [15:57:59] I can believe that [15:58:02] I guess that in k8s every component is smaller, so it's individual config is simpler [15:58:19] and also, way more robust in reporting what is alpha/beta/stable feature [15:58:37] yeah, that's for sure [15:58:41] they have a clear api deprecation procedure yes [15:59:10] dcaro: are you talking about "# of Connections:1/2 - Speed:1G/10G."? That's just the task waiting for me to fill in details [15:59:36] andrewbogott: yes, as one of the options is bolded, I though it was chosen already (I thin it's bolded? not clear) [15:59:51] https://usercontent.irccloud-cdn.com/file/lUTjm9W5/image.png [15:59:59] I see that too, don't know what it's about. but I'm about to copy racking details from https://phabricator.wikimedia.org/T351332 [16:00:05] 👍 [16:01:55] thanks for taking care of all that btw. [16:02:20] sure thing. Still bullseye right? [16:02:52] yep, I really hope to start upgrading stuff soon, I don't think we can wait for dell to sort things out first [16:03:11] s/can/should [16:04:23] ok, I updated racking details, you can check my work if you like. [16:04:37] Things will be unbalanced for a bit since 100[1-3] are all in the same rack [16:05:31] that should be ok [16:06:41] hmm, yep, that's not so nice actually [16:06:54] it would not only be imbalanced, but unused [16:07:16] as ceph would not be able to use the extra space in one single rack, we should try to spread them in 3 racks [16:07:31] andrewbogott: how much time until we get the next round of new servers? [16:08:08] * arturo offline [16:09:42] dcaro: We're getting four soon (probably at the same time as those three we were just discussing) [16:09:51] Then a whole lot more in Q2. [16:10:33] andrewbogott: ack, are those replacements also? (/me thinking on trying to spread them around so they don't get unused for long) [16:12:05] these? https://phabricator.wikimedia.org/T351332 [16:12:30] Despite the task title, those are expansion and not replacement [16:13:17] So... this is messy. Right now: 3 replacements, 4 new. Q2, four more new, Q4, a bunch more replacements. [16:13:36] hahaha, okok, nice, I want to keep an eye on the imbalance [16:13:38] At each of these stages we'll probably get fewer but larger servers than we're currently budgeted for. [16:13:50] so the 4 new can be racked anywhere right? [16:13:54] Because the costs work out better that way and it's more storage anyway. [16:14:13] Yep! Right now I have the request for one in each rack, but if we want to balance out with those three replacements... [16:14:23] yep, that's what I was thinking :) [16:14:25] well, we can put all 7 of them wherever. [16:14:58] we are losing 3 hosts on E4 right? (the 3 being replaced) [16:15:02] Right. [16:15:12] hmpf... I need some maths xd [16:15:30] Are we trying to balance # of nodes, # of drives, or capacity? [16:15:58] capacity [16:16:17] as the drives are not the same size (and thus the nodes have different capacity) [16:16:21] righ [16:16:22] t [16:16:42] so we lose 1.7T*8*3 in one rack [16:16:50] so there's no way to really balance with 7 coming in. We want for sure 2 in e3, one each in F4, C8, D5 [16:16:58] and that leaves us with 2 more to put someplace [16:17:42] we will lose some capacity due to not being exact multiples yes, that's what I want to minimize [16:19:01] okok, interesting, so we can replace the 3 hosts on E4 with 1.5 new hosts xd [16:19:15] (12 3.82T hard drives) [16:19:16] yep, it's 2:1 so the math is pretty easy [16:19:29] except for... modular math with a prime number of servers :p [16:19:30] that's nice [16:19:36] yep xd [16:20:07] so yes, we add 3 on E3, 2 on F4, 1 on C8, 1 on D5 [16:20:25] yeah, and then try to remember when the new order comes in in q2 [16:20:29] yep [16:20:47] next goes in C8 -> D5 -> .... [16:20:53] well, we have to take into account replacements though [16:20:54] :/ [16:21:22] but until Q4 we don't have any, so we can rethink then [16:21:45] (and shuffle some hosts around if needed) [16:22:15] We can also see about switching to supermicro for q2/q4 which means it's unpredictably how many servers we'll get. [16:22:28] xd [16:22:35] less to worry about then \o/ [16:22:37] hahaha [16:22:42] yep [16:23:04] Anyway, I updated the racking tasks so we'll get 3, 2, 1, 1. And we'll have the conversation again in a few months. [16:23:14] * dcaro sweating a little bit [16:23:26] sounds good to me yes 👍 thank [16:23:30] *thanks [16:23:42] Mostly the trend is good, we're getting fewer, bigger drives which helps our rack space problems a LOT [16:24:18] although I guess it means that the roundoff for HA is more, that way? [16:24:24] * andrewbogott going to stop thinking about this for now! [16:24:30] not really, as HA is per-rack [16:24:40] (well, if a host dies yes, more data will need shuffling) [16:25:03] but rack-wise the capacity is the same [16:25:12] (kinda, a bit higher overall) [16:25:15] ok, that's good [16:26:01] there are even bigger ssds available now so with price changes + a new vendor in the mix all bets are off for next year :) [16:27:17] keeping it interesting ;) [16:55:42] I got a proposal page for the coding styles decision https://phabricator.wikimedia.org/T361804#9727360 I'll leave it there for a bit if anyone has any comments please add them there/edit directly [17:07:02] * dcaro off [17:26:10] Anyone know what's up with the 'new file: hieradata/role/common/idmcloud.yaml [17:26:16] on the toolforge puppetserver? [17:28:34] anything in the git-sync-upstream service logs? [17:28:39] sounds like something went wrong during rebase [17:28:57] yes, probably. Should I just remove the file to get a clean sync? [17:29:29] the log error is [17:29:31] https://www.irccloud.com/pastebin/EyZ4Ln3w/ [17:29:46] which I can't make a lot of sense out of [17:30:53] I mean, I understand it, but it's clearly symptom and not cause [17:33:17] taavi: did you fix something or did it heal on its own? [17:33:22] no [17:34:27] huh [17:34:34] welp [17:34:42] the cron must've re-run and fixed whatever it was [17:51:49] * bd808 lunch