[07:11:23] bd80.8: nice :), we should definitely have it in the radar, and help with it [07:17:12] artur.o: I have returned to using terminator xd [07:28:36] :-) [07:52:51] it's pretty easy to create url handlers :) [07:52:53] https://usercontent.irccloud-cdn.com/file/5FPOvlm5/image.png [07:53:29] https://www.irccloud.com/pastebin/Q8L2jryh/ [07:55:06] the stuck procs check is flapping on tools-k8s-worker-nfs-29, does anyone want to have a closer look or should I just reboot? [07:56:18] it went down already is it recovering? [07:57:39] oh wow, I was not seeing it in the graph because the color for it that was auto-selected was dark brown on a black background xd [07:58:03] it looks steadily high yep, probably something got stuck [07:58:48] since yesterday morning more or less (been piling up D processes) [07:59:38] anyhow, feel free to reboot [08:28:20] dcaro: I might copy that plugin! [08:29:34] feel free :), and share any you might come up with too! ;) [08:30:20] I'm using layouts a lot lately: [08:30:22] https://usercontent.irccloud-cdn.com/file/Fwv1ZEjA/image.png [08:32:00] for example the reimage ones open a couple of ssh sessions to cumin servers AND open the mgmt password [08:41:35] I'm seeing policyReports similar to yesterday's, but related to runAsGroup this time [08:49:23] hmm, maybe it just stops at the first issue, so getting rid of one just lets it get to the next? [08:49:41] blancadesal: have you deployed the 'tool in the path only' version of the envvars on toolsbeta? [08:50:34] (/me getting things like Jun 26 08:48:19 toolsbeta-nfs-3 uwsgi-toolsdb-replica-cnf-web[246473]: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.svc.toolsbeta.eqiad1.wikimedia.cloud:30003/envvars/v1/envvar/TOOL_TOOLSDB_USER) [08:51:01] dcaro: if you mean https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/33, then no [08:51:19] yep that, okok, looking [08:55:11] oooohhhh, some change in toolforge-weld broke the replicacnf code (it expects to catch an HTTPError when an envvar is not there) [08:55:32] this might not be breaking in tools because we might not upgrade the toolforge-weld package there [08:55:53] I think I know what is happening with the policyReports [08:56:33] when we create a policy for Pods, kyverno autogenerates policies for Deployments and other pod-generators [08:56:44] wait no, scratch that, looking [08:57:03] (for the replicacnf) [08:59:20] blancadesal: hm... I think that the post endpoint for envvars might already be broken [09:00:15] wait no, the issue is that the interface changed [09:00:38] now to create envvars you need to pass `{"value": ..., "name": ...}` and not put the name on the path :/ [09:06:32] dcaro: do you mean maintain-dbusers should have been updated at some earlier change of the envvars service but wasn't? [09:06:54] dcaro: could you please +1 here? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 [09:07:01] blancadesal: I think so, looking in tools, it might have been broken for a while [09:08:45] thanks! [09:12:41] blancadesal: hmm... on tools using the envvar name in the path works :/ [09:14:20] possibly stupid question: what's calling this in toolsbeta? I thought toolsbeta users don't have replica access [09:14:48] taavi: nothing, just the tests [09:15:19] blancadesal: in toolsbeta we have version 0.0.50, in tools 0.0.49 of envvars-api [09:16:27] yep, the difference is https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/commit/42829b67a8c43dd119ec4ed247bacd964ffc5b01 [09:16:37] that's the commit that removes the endpoints I think [09:17:14] heads up, setting kyverno policies in tools to Enforce [09:18:11] dcaro: not sure why raymond didn't deploy that in tools. also, I think I remember there was some issue with that patch to begin with. It was rolled back the first time [09:19:00] related: T367961 [09:19:00] T367961: envvars-api 0.0.50 depends on unreleased envvars-cli changes - https://phabricator.wikimedia.org/T367961 [09:19:56] I see, yep, that's what the replicacnf is having on toolsbeta, it needs changing [09:20:51] toolforge-deploy says 50 is the version, I'll revert that while we fix the replica_cnf code to work with the new endpoints [09:23:07] re-opened the task [09:23:46] oh, just created a new one xd [09:23:49] it's slightly different [09:23:50] T368516 [09:23:50] T368516: [envvars-api] version 0.0.50 introduces breaking changes that need adapting for replica_cnf service - https://phabricator.wikimedia.org/T368516 [09:23:58] happy to merge if you prefer [09:24:08] revert mr https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/352 [09:24:15] those are two slightly different bugs I think [09:24:34] so I marked yours as a subtask since that needs to be fixed before rolling forward [09:25:10] ack [09:25:42] xd [09:25:47] blancadesal: do you want to work on this with me? [09:26:11] (feel free to say no) [09:26:29] dcaro: yes :) [09:26:48] dhinus: is T368066 on your radar? just needs the add-wiki cookbook to be ran, I saw someone ping about that on -data-persistence [09:26:49] T368066: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066 [09:27:02] 👍 we need to make some changes to the replica_cnf, can be rolled up with the changes to use the `tools` in the url too [09:27:08] mmmm [09:27:11] dcaro: I'd like to finish up a couple things I was doing first. do you want to pair after lunch? [09:27:18] now we have functional tests in toolsbeta so that's nice to test :) [09:27:28] blancadesal: sure, works for me [09:27:28] just noticed my announcement email about today's operation never made it to the mailing list because requiring approval? [09:27:37] arturo: I approved it yesterday [09:27:41] arturo: I got it [09:27:45] oh, ok, thanks [09:27:49] arturo: I got it too yep [09:27:52] I think we should figure out my permissions [09:28:00] and I think fixed the permissions to not hold your messages again [09:28:09] taavi: ok, thank you [09:28:10] but it's certainly visible at https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/ [09:28:56] great [09:29:13] dcaro: sent a meeting invite [09:29:21] blancadesal: thanks! [09:29:30] 👍 [09:39:19] arturo: maintain kubeuser hang alert fired, are you doing anything with it? [09:46:40] oh, it's updating all the poilicies [09:48:05] yeah [09:48:08] it is kind of expected [09:48:29] run just finished [09:48:30] 2024-06-26T09:47:41.454079741Z finished run, reconciled 22 admins, 3311 tool accounts [09:49:20] taavi: no I didn't see T368066, but I randomly noticed there was a new wiki because it triggered the "private data" alert a few days ago [09:49:20] T368066: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066 [09:49:35] dcaro: taavi: any concern with me roll-rebooting k8s workers to try surface pod policy problems? [09:49:56] arturo: you can first try with a single tool you control [09:50:13] taavi: I think I can claim T368066 and run the cookbook. thanks for the pointer [09:50:13] to avoid breaking many [09:50:14] yeah, let's start smaller scale please. individual tools, maybe an individual worker, etc [09:50:17] (in the worst case) [09:50:42] ok [09:51:06] you can use wm-lol, if you want (I have that one, not sure if it has a continuous job though, it has a webservice) [09:53:45] I will rollout-restart the pods for openstack-browser [09:53:55] the deployment resource was created from an old webservice-cli version [09:56:32] done, no problems [09:56:50] this confirms that new pods generated by old deployments can work with policies in Enforcement mode [09:57:57] to reduce the noise of the PolicyReports, I would like to merge this [09:57:58] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/51 [10:03:13] dcaro: could you please approve? [10:04:11] will that allow to create bad deployments though? [10:04:54] kyerno will not enforce anything on the deployments [10:05:40] don't we want that? [10:05:47] it should not matter for deployments created via toolforge-jobs or webservice-cli [10:05:56] because they already contain the right pod templates [10:06:14] (not taking bugs into account I guess) [10:07:11] kyverno will still enforce pods normally. Not having the autogen rules is actually more closely resembling what PSP was doing [10:08:27] so any deployment that is not passing the validation, will generate failing pods? [10:08:52] (but the users will only notice when the deployment generates the pods, not when trying to create/change the deployment?) [10:08:53] no, because kyverno is mutating pods resources, then validating them [10:10:51] hmm, that's not clear to me, so there's no way a deployment can generate a pod that does not pass the validation after the mutation itself? [10:11:08] correct [10:11:13] this is all a bit redundant [10:11:34] I corrected the templates on toolforge-jobs/webservice-cli to generate the right attributes [10:11:44] then we have a kyverno policy to mutate and inject the same if not present [10:11:57] then we have a kyverno policy to validate the config is correct [10:12:07] if we remove the first step, then later 2 still apply [10:12:12] I'm a bit worried about cronjobs [10:12:22] because they have an intermediary indirection [10:12:25] what happens if the fields are already present? [10:12:28] in step 1 [10:12:29] cronjob -> generates a job -> generates a pod [10:12:47] if the fields are already present, nothing happens, mutation will do nothing [10:12:50] * arturo brb [10:13:13] * arturo back [10:15:07] as long as the pods get validated we should be ok security-wise I guess [10:16:02] exactly [10:18:42] I don't fully grasp how that will reflect on user's perspective though, but we will get rid of kyverno anyhow in the next upgrade, so maybe not worth the effort on making it supernice [10:20:07] I don' thing the patch has any user-visible effect [10:20:23] but it will greatly reduce the PolicyReport noise that we are currently getting [10:22:48] well, to be fair, it's "real" noise, as in those objects are not passing the validations so they should be changed [10:23:03] but I see your point [10:28:56] thanks, deploying [10:51:52] I would like to merge https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 next [10:52:11] i.e, drop the PSP config for each account [11:08:26] maybe wait for a day? [11:08:33] just to make sure nothing breaks with the new stuff [11:08:49] ok [11:54:41] thanks [12:23:20] opinions? https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/52 [12:25:00] hmm, I wonder if that should be a replicaset instead of a deployment [12:25:20] if not that seems ok [12:27:24] all upstream docs about replicaset refer to using deployment instead [12:47:26] hmpf my reimage of cloudcontrol1006 failed, can I just use `--now` to retry? [12:48:05] why are you reimaging cloudcontrols? [12:48:37] ooops that was cloudcephosd1006 [12:48:39] not cloudcontrol xd [12:48:46] phew [12:49:06] for --new: usually yes, but it depends on how it failed [12:50:39] yeah [12:54:52] it got hung on the install screen, when I connected it just showed the install menu on the top and a grey screen [12:57:10] * arturo food [12:57:19] oh, now it showed something, network configuration failing (dhcp not found) [12:59:27] a good hack is to try to switch to an another screen window after attaching to the console of a a running reimage [13:00:12] that forces it to re-draw the entire screen [13:01:46] dcaro: hmm, which part was complaining about no dhcp leases? the bios or debian-installer? [13:02:20] if the former, I'd check it's configured to boot off of the correct interface, as both of its ports are showing as down on the switch side https://phabricator.wikimedia.org/P65477 [13:03:20] looking [13:03:24] https://usercontent.irccloud-cdn.com/file/vnqxacfD/image.png [13:05:12] I think that the interface changed names after the reinstall [13:05:55] the linux-side interface names changing is not a problem [13:06:32] the installer sees enp175s0f0np0 and enp175s0f0np0, netbox has a different name [13:06:42] if you switch to a shell in the installer, is it showing all the interfaces? it should have two built-in 1G nics and then two additional ones on the 10G cards [13:07:57] yep, sounds like it [13:07:59] https://www.irccloud.com/pastebin/JTTVBdFg/ [13:08:29] yeah, that looks correct [13:14:25] hmm, topranks do you remember if there's any special config/manual change needed to reimage cloudcephosd nodes? [13:15:10] dcaro: no [13:15:20] not that I recall [13:16:00] newer OS might use different interface naming but the cookbooks will just use the "primary_ip" attribute for the DHCP part regardless of name [13:16:11] and then should import the interfaces (with new names) once installed [13:16:28] dhcp failed i see hmm [13:16:32] so I wonder if it's trying to DHCP off of the "secondary" port [13:16:36] what host is it? [13:16:46] cloudcephosd1006 [13:17:05] taavi: with the name change I'm not sure which one would be the secondary port [13:17:53] hmm I wonder [13:18:04] dcaro: I assume the cookbook has now failed? [13:18:06] probably enp175s0f0np0 would be ens3f0np0 in netbox, but not sure, is there a way I can check which link a port is connected from the machine itself? [13:18:18] topranks: I have not retried [13:18:20] should I? [13:18:41] perhaps, I was just checking, I see on the install server there is no specific config for that host right now [13:18:56] but that makes sense if the cookbook has failed/finished [13:18:56] okok, let me rerun with `--new` [13:19:30] in terms of the names you are right, enp175s0f0np0 == ens3f0np0, just look at the numbers at the end, f0, p0 or f1,p1 [13:19:32] oh, new prompt, puppet version should be 7 right? [13:20:09] that depends if the role has been updated to work with puppet 7 [13:20:16] most have [13:21:47] should be ok yes (cloudcephosd1001 is running puppet 7) [13:21:53] the NIC firmware on that host is on 21.60.2 [13:22:18] that is slightly lower than perferred, 21.85, but afaik does not have the driver issue which causes the link to not work inside the debian installer (which we have in 22.x) [13:22:31] dcaro: yeah if others are puppet 7 go with the same [13:23:25] okok, running [13:23:36] dgcp config passed now [13:24:08] topranks: is there anything I can do to upgrade it at the same time? [13:24:35] dcaro: the firmware cookbook can be used but let's leave it if you have reimage [13:24:56] okok, I might want to do that as part of the reimages for the rest of osds too [13:24:58] 21.6 is good afaik so I don't think urgent. but just to say I checked and it's not 22.x, which causes a problem very like what you had [13:27:29] it's now trying to configure the network with dhcp [13:27:35] (does not say which interface it's using :/) [13:27:50] hmm yeah, the speculation it's using the wrong one could be right [13:28:18] and failed [13:29:18] when retrying it says it's 'detecting link on enp175s0f0np0', is that the correct one? (it's the one in nb with the tagged interface + ip) [13:29:36] hmm... why is it tagged? [13:29:48] are we configuring vlans at the OS level? [13:29:49] tha'ts the right one [13:29:55] yes [13:30:03] but the issue here is that the link is down.... [13:30:07] okok [13:30:07] https://www.irccloud.com/pastebin/x8izv4H6/ [13:30:16] taavi said the same yes [13:30:27] should I try to upgrade the firmware? [13:30:40] I can also try configuring it manually, see if it comes up [13:30:51] oh, someone is connected to the console too? [13:31:36] topranks: does it show up now? [13:32:20] that's me [13:32:30] yeah it shows up after "ip link set up" [13:33:22] https://www.irccloud.com/pastebin/WyB6Cgns/ [13:33:26] * dcaro snooping on your debugging :) [13:33:37] I am really not sure what is going wrong in that case...... [13:33:49] you'd think the debian-installer is doing something very link those commands [13:33:55] bring link up, run dhcp client [13:34:08] ip a [13:34:11] oops [13:34:24] at least it's not the root password :P [13:34:32] hahahaha [13:34:58] that's better yep [13:35:29] this AI stuff is amazing [13:35:35] xd [13:35:35] it's helping me fix it on the console :P [13:36:08] yep, network looks good now no? [13:36:11] I've two ideas, but neither are very scientific [13:36:28] I'm open to trying things [13:36:42] oh, now it passed [13:36:52] ugh what the hell [13:36:56] my two ideas were [13:37:14] 1) shut down the second port on the switch side, to force it to use the right one, just in case that is what's causing it [13:37:23] (but it doesn't seem like it as it tries all 4 ports) [13:37:29] 2) upgrade firmware [13:37:44] (but seeing as it came up when we forced from shell doesn't seem like firmware/driver incompatibility) [13:37:58] 3) leave this for now see if it completes? [13:38:03] sure [13:38:24] I'll have to reimage a bunch more, so if it's a common issue it will pop up again [13:38:28] oops, another error xd [13:38:46] hmm [13:39:00] this one's different [13:39:03] https://www.irccloud.com/pastebin/i0IdKGOT/ [13:39:26] I'm lost on that one [13:40:16] I'll keep digging :) thanks for the debugging session ;) [13:40:22] maybe something in the DHCP stuff wasn't passed properly cos I ran it manually [13:40:32] you could try if it happens again [13:40:36] drop to a shell [13:40:39] it seems to get there, but it's getting the wrong content [13:40:47] do the "ip link set dev up" [13:40:53] same as I did [13:41:08] but don't run the DHCP from the shell, exit once it's up and see does it continue [13:41:16] but tbh none of that should be needed [13:42:52] hmmm, maybe dns [13:42:55] yep [13:42:59] wget: unable to resolve host address 'mirrors.wikimedia.org' [13:43:21] should the dns servers be pingable? [13:43:23] https://www.irccloud.com/pastebin/Z1fHwEfX/ [13:48:36] topranks: ^? [13:49:08] from other nodes it seems it should [13:49:10] https://www.irccloud.com/pastebin/zeG64BcM/ [13:49:18] maybe vlans are off somehow? [13:53:46] oh, no, something happened, now it does not ping to the gw anymore [13:53:49] https://www.irccloud.com/pastebin/2wBSKlFO/ [13:54:50] and going back and reconfiguring the network on the install screen seemed to work :/ [13:54:56] * dcaro confused [13:55:59] passed the debian mirron step now [13:56:02] *mirror [14:00:31] perhaps the busybox dhcp client didn't set the dns [14:00:53] I got it from the resolv.conf file, it was set there :/ [14:44:51] reimage of cloudcephosd1006 finished \o/ [15:01:28] dcaro: puppet versions fixed [15:01:42] taavi: thanks! [15:01:51] now it's failing with 'Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Ip[osd-cluster-ip]/Exec[ip addr add 192.168.4.6/24 dev ens3f1np1]/returns: Cannot find device "ens3f1np1"', which sounds like hiera file needing update for the new interfaces [15:01:52] just downgrading the puppet client? [15:02:02] that, and then refreshing the certs for the puppet 5 ca [15:02:07] taavi: yep, I have to fix that on hiera as the interfaces changed [15:18:07] dcaro: crazy, not sure what went wrong there with the reimage/dhcp [15:18:21] I guess we'll see on the others if the same happens, or if it was just some one-off ghost in the machine [15:34:58] yep, it was pretty weird :/ [15:35:02] fyi. the ceph alerts is mi [15:46:58] &me [15:52:35] * arturo off [16:20:08] grafana.wmcloud.org only has 30 days of data nowadays hasn't it? [16:21:43] looks like it (for cloudvps stats at least), I think we had set it up to limit by space, not date, but maybe we changed it [16:22:40] not a big deal [16:22:50] we just lost the ability to compare the state of things over a long trend ;] [16:23:24] yeah, retention there is 730h it seems [16:23:46] it seems we have that yep [16:23:46] --storage.tsdb.retention 730h [16:24:07] iirc the prometheus option to do reliable disk usage-based retention is a relatively new thing, and the metricsinfra setup predates that [16:24:12] ah that explains it [16:24:33] maybe I did it in cloudmetrics yep [16:24:35] I came asking cause we had a regression which I tracked to May 13th (more than 30 days) [16:24:44] and we thus don't have a baseline to compare against [16:24:47] too bad :] [16:24:50] but not a big issue [16:25:07] I guess 30 days is fine. Thank you! [16:25:15] also currently that 730h means about 70G of metrics, so could relatively easily bump that number to be higher (but not retroactively for already deleted bits of course :-) [16:32:35] those emails/alerts about cloudvirt1063 are vriley moving it to a different rack -- no VMs there. [16:37:51] ack [16:38:00] I was just going to ask xd [17:08:18] * dcaro off [17:08:22] cya tomorrow [17:37:45] everyone: I just rebuilt the codfw1dev bastion, so expect host key warnings next time you ssh there