[07:16:31] cdanis: very nice! thank you for the link, cc volans [07:52:28] wikitech-static icinga checks have been flapping as per -cloud-feed, I've opened T409029 though I can't immediately followup/investigate further [07:52:30] T409029: Flapping wikitech-static icinga alert - https://phabricator.wikimedia.org/T409029 [09:21:10] morning [09:21:20] cdanis: awesome 👀 [09:29:38] Damianz: that is something that has been on my mind for a very long time! nice! though running long-term stuff on the bastion is not allowed (will be killed at some point) [10:04:24] quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1201011 adding elasticsearch metrics gathering [10:04:43] (we have the exporters setup, but not pulling from prometheus) [10:27:21] jq is really useful for mangling dashboard jsons xd, this is to change the datasource from all panels in a dashboard: `cat mydasboard.json | jq '(.. | objects | select(has("datasource"))) |= (.datasource = { "uid": "$datasource" })'` [10:27:26] (for future reference) [11:10:23] Another quick review, to the deploy cookbook, not sure how it was running before :/ https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1201029 [11:11:39] * dhinus reading the backscroll... [11:11:53] dcaro: I also bumped into metal3.io because it's listed (among other tools) in the k8s SIG gdoc [11:11:56] https://docs.google.com/document/d/18-dlIEOJjWqOXQ7doW7izYoa7-gLDAkQCVwQJwfyEqk/edit?tab=t.0#heading=h.n3asovhcktib [11:13:34] oh, I found it on the list of presentations for the kubernetes community days suisse romande (that happens at CERN in december, would not be able to attend though) [11:15:49] https://community.cncf.io/events/details/cncf-kcd-suisse-romande-presents-kcd-suisse-romande/ if you are interested :) [11:19:51] nice, thanks [11:42:41] volans: I created an mr during the weekend to split the logs components, I though it was going to be harder :) https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1038 [11:43:20] it creates a 'common' dir with the loki stuff, and then each has it's own values and directory, so they can be deployed separatedly and use different versions/values for loki stuff [11:43:38] (without having to create *Tracing, extra variables) [11:45:13] dcaro: I saw it, not sure if it makes sense to have a "common" component though, I see there is no existing use case for that. If we want to split having some duplication might even not be the worse of it given that maybe with time they will diverge even more. That said having them both in the same component is also not too bad I guess? [11:46:00] I've just sent what I have in MRs so that people can have a look. there is still a corner case that doesn't work locally on lima-kilo and seems to be a pretty-annoying rabbit hole, but maybe is not a blocker? [11:46:36] once the feedback for the existing MRs are in I can totally adjust it to split fully, partially split or leave as is based on what's the general consensus [11:49:23] dcaro: yeah... I'm thinking about having a container ssh there and run tests (I do this from CI), which also sees if the bastion is working. Would need to make another account just for that to avoid any cross tool privilege escalation. I was more interested to see how hard it was and apparently it was 3 coffee and a sandwich worth [11:54:34] volans: the main issue with the same component is that they are deployed at the same time, and tracked in the same code, when they are two different instances, managed by different people with different goals, the only happen to use the same software (loki) [11:56:49] agree, but I would like to have a final answer that is shared by everyone becaue my first attempt was with a new component, then I was told to use the existing one, now there is a third option to have it half shared. I'm happy to change it but I'd like it to be a shared decision and not go forth and back each time. [11:57:21] would the "common" approach support the fact that the two instances might use a different upstream version of the chart? [11:58:03] jobs-api:231, jobs-api:232 need the needs review label adding if someone feels like being the gitlab bot (others also, but they are less interesting) [12:15:44] volans: yep, the common is just a library of values, not really a component (can't be deployed, but other components include values from there, and can override any independently) [12:15:56] * dcaro lunch [13:56:03] semi-random question: what's the code/repo that pushes out the "bump" MRs? I'd like to tweak the description to link to the specific releases, e.g. https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1023 [14:22:39] godog: that's in cicd https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/toolforge-cd/create_toolforge_deploy_mr.yaml?ref_type=heads [14:23:08] dcaro: neat, thank you ! [14:43:11] dcaro: I'm pretty sure that got the numbers in https://phabricator.wikimedia.org/T405283#11294911 from you; a couple of posts down Alex is coming up with very different numbers (from https://grafana.wmcloud.org/d/8GiwHDL4k/infra-kubernetes-cluster-overview?orgId=1&from=now-2d&to=now&timezone=utc&var-cluster=P8433460076D33992) . Do you have time to take a look and comment on the difference? [14:54:18] andrewbogott: sure [15:30:24] hmmm... now I'm not sure about the memory measurements [15:30:32] should be use free memory or available memory? [15:30:35] https://www.irccloud.com/pastebin/AVlRbiL1/ [15:30:41] there's a huge difference [15:30:47] (more than I expected) [15:32:22] I think kubernetes uses available to schedule things [15:32:31] so maybe that's the right one to use? [16:34:16] The thread about Ceph on Trixie starts with a question from [16:34:25] 'Andrew' and I spent a while worrying that it was me and I forgot... [16:34:25] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2I6ZQLING7XKWL2ED2CQC47YQJDAME7W/#D4FAH2IYLBN55UXW7IA3HMJAUC2Z6JNT [16:34:32] but I'm pretty sure it wasn't me :) [16:37:00] LOL [16:43:49] quick review https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/77 [16:44:02] hahahaha [17:09:41] dhinus: fixed your comment, can you double check? I need to merge to test properly [17:11:23] dcaro: +1d [17:11:32] thanks! [17:11:47] I didn't check line-by-line but I couldn't find anything else that was wrong :) [17:12:23] using the gitlab-ci-local to test and check all that :) [17:12:37] * dhinus crosses fingers [17:14:42] maybe you can also use gitlab-ci-local to test a branch of the gitlab-ci repo? as in, "including" from a branch that is not merged yet? [17:14:59] (not a big issue to merge first and commit a fix if needed...) [17:21:19] yep, the issue is that there's extra indirects, as in the one I include includes another, so I have to add the extra 'ref: ....' to each one that does an include, and taht means having a branch full of extra 'ref' that are not meant to be merged [17:21:38] so merging allows me to not have to add all those to the branch, and as it's not used anywhere yet, there's no problem [17:21:50] (as in, it can't break anything) [17:22:25] this is working both locally and remotely :) https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/98 [17:23:26] ah ok there's a double include [17:23:50] yep, we do that a lot :} [17:23:53] yep agreed there's no risk in merging [17:28:22] ansible-lint takes forever to setup... [17:31:22] ohhh, dave holland is one of the founders of chick corea, that's why he ringed a bell... [17:32:23] sorry, founder of circle, with chick corea [17:54:15] another quick one, moving the toolforge-ci stuff to trixie https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/78 [18:01:27] Raymond_Ndibe: ^ I'm clocking off, if you are working on toolforge, can you please take care of moving all to trixie? (ex. reviewing + merging that and creating all the MRs on the other repos if not there already, did a couple) [18:05:21] dcaro: +1d [18:05:28] thanks! [18:06:59] now I'm off xd [18:07:01] * dcaro off [18:07:02] cya! [19:22:52] * dhinus off