[07:12:13] greetings [09:05:02] morning [09:17:40] morning [09:23:40] * godog errand [09:24:24] today I woke up and decided to move back to sway, so my meetings/screenshots and screen sharing might be a bit flaky today xd [09:24:34] (while setting everything up again... things change quickly) [09:31:12] still seeking a +1 on https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1011 [09:32:13] taavi: +1 [09:34:30] thanks [11:23:09] while I catch up on all backlogs, by any chance anyone has took over my work on the tracing loki? [11:26:35] xd, not that I know of [11:27:32] * dcaro lunch [11:28:25] ack, thx [11:41:46] * godog lunch too [11:50:34] I created T408543 following the discussion in last week's team meeting [11:50:35] T408543: MTU setting in IPv6 VMs causes issues with Docker - https://phabricator.wikimedia.org/T408543 [11:50:53] I also added it to the agenda for the Network Sync meeting tomorrow [11:53:21] * dhinus lunch [12:48:37] neat [14:25:13] random: has anything changed with the Toolforge Kubernetes ingress in the last 24 hours or so? The iw tool (redirector for 'toolforge:" interwiki links) is not working all of a sudden. Its an ingress-only tool that configures the nginx ingress to redirect to the proper *.toolforge.org hostname. [14:25:46] https://iw.toolforge.org/bd808-test is my quick test that is failing. It should redirect to https://bd808-test.toolforge.org/ [14:26:29] bd808: hmm, I upgraded ingress-nginx earlier today [14:27:04] taavi: that sounds related [14:27:36] agreed. is there a version of tool-iw in toolsbeta or is it deployed in tools only?> [14:28:02] I don't think I ever put one in toolsbeta [14:28:43] hmh. the ingress-nginx logs in tools are too noisy to be of any use for these kinds of things [14:28:44] https://wikitech.wikimedia.org/wiki/Tool:Iw has the ingress config on it. [14:30:47] I managed to get a not-very-helpful validation error out of it: https://phabricator.wikimedia.org/P84313 [14:31:25] ok, its not liking `kubernetes.io/ingress.class: nginx` in the annotations [14:31:41] that too, but that doesn't seem to be the main issue? [14:31:48] > W1028 14:30:15.324067 7 validators.go:243] validation error on ingress tool-iw/iw-domain: annotation temporal-redirect contains invalid value https://$1.toolforge.org/$3$is_args$args [14:32:10] bd808: sorry I missed this during testing. unfortunately the upstream ingress-nginx changelogs aren't particularly user-friendly. [14:34:38] This at least points me in the direction of figuring out how to make nginx-ingress-controller:v1.13.3 work. I think its going to be down for a bit though. I need to take care of morning chore stuff around the house first. [14:36:00] bd808: would you like me to roll back the ingress update for now? [14:36:49] taavi: if that is a reasonable option from your side it would help me feel less stressed. [14:39:12] argh, gitlab's apparently logged me out again [14:39:14] but yes. [14:42:16] {{done}} [14:42:33] also T408571 [14:42:34] T408571: Add functional test for tool-iw handling for new ingress-nginx releases - https://phabricator.wikimedia.org/T408571 [14:46:44] taavi: thanks! I will try to figure out what needs to change today. Worst case solution would just be me making the tool have a webservice that does the redirect instead of asking the nginx ingress to do all the work. [14:50:14] bd808: yeah. ingress-nginx upstream has been working on fixing the various security footguns associated, which is good in theory but might indeed break some of our edge cases. and please poke me if I can be helpful with making it work on the new versions. [16:08:14] andrewbogott: is there an easy way to detect unused security groups in a project? [16:08:40] not that I know of, you might need to write a script [16:08:58] I might do that, since I think like half of the toolsbeta groups are unused [16:24:36] would it be reasonable to make that tool run a small app instead of directly hooking into the k8s ingress? [16:25:09] even if it's less performant, it decouples the functionality from the k8s infrastructure for future upgrades [16:47:07] So looking at T408583, part of the issue is https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/233 but part of the issue is the default for `mount` is set before the checking for if the image is a build pack... so build pack images now mount nfs by default :'( [16:47:08] T408583: filelog in buildservice image is no longer relative to $TOOL_DATA_DIR on newly deployed tools - https://phabricator.wikimedia.org/T408583 [16:47:35] I'm not 100% up to speed on what the goal was re setting the defaults, is it safe to make that `None` then set the default in the validator like it use to be or is that going to break some client generating thing? [16:48:38] buildpacks should not mount by default no [16:48:41] wait [16:48:42] depends [16:48:56] e.g. https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/core/models.py?ref_type=heads#L122 means skipping https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/runtime.py?ref_type=heads#L150 [16:49:09] So I have build packs in prod that now have `toolforge.org/mount-storage: all` [16:50:34] the interaction is a bit more nuanced, the create_job already comes with the job created, the point in which the mount property should be handled comes earlier in the api layer if I'm not mistaken [16:51:40] the validation I think is the one that should be setting the mount property correctly, and removing it if needed from the model set_fields [16:52:28] The validation checks it but it's already set at that point [16:52:48] https://pastebin.com/KNgDf2vN so all of these (ok except 2) are build pack deployed from components with no mounts specified [16:53:17] yep, but it can change it's value depending on the rest of properties of the job, and if it was explicitly passed or not [16:53:19] I'll poke at the code a bit more, but I'm 93.5% sure this is broken [16:53:37] I'm not saying it's not broken though, I'm almost sure it is [16:53:37] xd [16:53:38] yeah for sure it gets set in a few places [16:53:57] dynamic defaults are quite hard to manage [16:56:02] yeah.... there are defiantly ways but it's a moving target [16:58:48] hmmm, I think it should be set to the right value at the edges (and the set_fields changed correctly if needed), meaning the validation at the api layer (for input from the users), and when fetching the job from k8s [17:03:55] Damianz: let me know if you stop working on that, and I can take over if needed, so we get it fixed timely [17:04:23] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/234 I think [17:04:47] Not tested yet, but that seems to make sense in my head [17:05:16] I guess this is not noticeable in the workers getting drained as it only happens on a re-deploy, otherwise would have been a good test for those cpu alerts [17:13:40] Damianz: added a couple comments, I think we might have to tweak also the way we get the job from k8s, as it should be flagging the default value as unset, but maybe it already does [17:14:30] let me have a look [17:17:38] the bit of code is https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/jobs.py?ref_type=heads#L524, double checking [17:54:02] Having a bit of an argument with the test essentially because get_dummy_job uses standard images and get_continuous_job_fixture_as_job uses build pack images [17:56:41] ack [17:57:58] hey guys I'm actually off tomorrow so I won't make the network catch up meeting. we could maybe re-schedule for next week if that works better? [18:00:31] topranks: I think next week is fine, let me try rescheduling [18:00:40] topranks: sounds good to me [18:00:43] dhinus: thanks! [18:00:53] did it work? [18:01:05] Damianz: it's not passing the tests, looking (mount: all is still being set for buildpack images when not passing it) [18:01:07] it moved in my calendar, but I'm not sure I moved it for everyone :) [18:01:24] yep moved for me! [18:01:43] https://www.irccloud.com/pastebin/PZPZ4dLj/ [18:02:28] Is that the current rev, or the one where i pushed with the if missing? [18:03:45] I just pushed what I think is working (sorry been iterating a bit), the issue I'm having is the `Strips off filelogs if there` test [18:04:16] let me recheck [18:04:21] That is specified as 'buildpack' (shared with other tests) so the mount is getting into the diff, but it's actually really 'standard' [18:04:59] hmm, there were some mocks iirc to return the type of image in some tests, not sure that's the issue, I can give it a look [18:05:21] there are some ifs in some places [18:06:21] hmm... [18:11:55] I think the issue is that it's missing overriding the image in the patch [18:13:13] I found the line of code doing it - we strip `mount` in `get_continuous_job_fixture_as_job` [18:16:47] hmm... the test is also always returning a buildpack image (by the mocked image_by_container_url) [18:18:41] yeah, some of the data in the tests seems to work a bit by chance [18:25:36] So if I make a test that sends a job asking for logging but no mount with a build pack type image to the api endpoint then it fails properly [18:26:04] But in the tests we are not setting the image type but just the name in most places, then the type is not consistent as you say in other places [18:27:15] dcaro: I'm going to go for dinner and come back to this later, feel free to get there first... my brain is running warm [18:28:23] Damianz: heheh, np, thanks! [18:37:04] got that test passing with a little conditional... ugly but works. added a test that hits the api and shows things failing as expected [18:37:17] code needs cleaning up and testing but I think it's ok.... really afk now [18:37:23] just did the same sorry, might have messed up, just push forced [18:37:46] https://www.irccloud.com/pastebin/3K01P7W1/ [18:37:51] no worries, I did say I was going lol [18:38:31] need to stop tricking myself into fixing these things and go fix like the heater not working... is getting chilly now xd [18:39:32] ack, I'll fix the stuff here, bonapp! [18:42:22] * dhinus off [19:52:44] * dcaro off