[08:52:18] Hello folks [08:52:46] I am rolling out the new istio bullseye images [08:52:49] - dse done [08:52:53] - aux done [08:53:09] - staging + staging-codfw done (thanks to Janis for the rollout) [08:53:16] - ml-staging-codfw done [09:09:56] elukey: <3 [09:18:46] jayme: o/ are you going to take care of wikikube? [09:33:21] yeah, will do in a bit [09:34:35] <3 [09:34:43] - ml-serve done [09:36:39] will also file the changes for cert-manager [11:33:10] jayme: Updated the cert-manager's code review for staging only with proper docker tags: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/978640 [11:33:13] (not urgent) [11:34:45] elukey: +1ed ;) [11:34:59] <3 [11:35:07] ok if I deploy them? [11:35:13] absolutely [11:35:39] we should leave them running for ~3 days to catch a full cert rotation before going to prod [11:35:46] definitely [11:43:30] deployed to all staging envs [11:54:03] nice, thanks [12:19:26] FYI, I'm deploying kube-state-metrics to the dse cluster. [13:23:30] I filed a change for ml-serve clusters as well :) [14:06:46] kamila_: o/ [14:07:09] when you have a moment I'd have some questions about the kube-state-metrics' dashboard [14:07:15] (trying to understand its values now) [14:07:36] sure, now works for example [14:08:20] yep! I am checking the k8s-mlserve codfw cluster [14:08:53] I don't get why memory used is all red, same for cpu requested [14:10:15] uh, probably a bug in my queries? checking... [14:11:09] ew what I thought I'd fixed that [14:12:01] ahh okok so I don't have to worry :D [14:13:01] (I mean about all the red signs, I thought mlserve was already under capacity etc..) [14:13:18] lemme know if I can help [14:14:09] it's grafana being annoying and taking revenge on me for wanting to be fancy and have a dynamic max :D [14:14:41] the max from query value isn't being applied correctly, which I apparently fixed in some places and not others :D thanks :D [14:17:28] should be fixed now, I hope [14:19:48] yep thanks! One nit that would be really useful is to expand the "i" info boxes explaining a little what metrics are representing.. Like cpu/memory usage is from the container's total over the k8s allocatable (so no check related to what metrics are used is needed) [14:21:19] ok, will do, thanks [14:21:38] CPU used is also something that I don't fully grasp right now - is it the ratio between how many cpus are used by containers vs the allocatable ones? [14:21:50] yeah [14:22:09] I kinda tried to explain that in the info boxes, but I may have done a bad job at it [14:22:37] nono please I am trying to ask all the questions so it will be Luca's proof (hence everybody will get it at first) [14:22:51] :D [14:23:13] what I am trying to get is what a value like 1.76 represents.. Is it like I am oversubscribing ? [14:23:25] oh [14:23:46] no, it's an absolute value [14:24:15] sum of all usage in what I'm hoping is k8s CPU units :D [14:24:20] (but I may or may not be correct :D) [14:24:42] ahhh ok, then I am very puzzled, I am pretty sure the usage is more right now [14:25:30] I am far from sure about the CPU usage queries tbh, I need to double-check it, I'm just not sure how to do that [14:25:36] btw, feel free to edit the dashboard, I won't mind :D I don't really have a lot of experience with k8s, so my attempts to make the dashboards useful aren't actually based in experience [14:25:52] okok no problem, going to keep exploring :) [14:25:54] (and also I may have written the queries wrong, which is a separate problem but also a problem :D) [14:25:57] thanks a lot for this work! [14:26:26] happy to be of some use :-) [14:27:57] tbh the CPU usage gauge is probably not useful [14:28:14] CPU usage is way too spiky for a single value to actually mean anything [14:28:33] the graph underneath is probably more meaningful [14:28:46] maybe we can plot it if it is spiky [14:29:44] yeah, it's plotted in the "CPU usage by namespace" thing (that graph is stacked, so the total can be eyeballed) [14:29:58] * elukey nds [14:29:59] *nods [14:30:02] but the numbers on that graph are small too, so if you think that it should be higher, then my query may be wrong [14:33:38] need to double check, I thought it was more but I may be wrong