[03:54:49] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [07:54:49] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [09:56:24] morning :) [10:38:54] (03PS1) 10MPGuy2824: Migrate usage of Database::delete, insert, update and upsert to QueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007862 (https://phabricator.wikimedia.org/T358831) [11:21:35] (03PS4) 10MPGuy2824: Migrate usage of Database::delete, insert, update and upsert to QueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007862 (https://phabricator.wikimedia.org/T358831) [11:40:22] (03CR) 10Ladsgroup: Migrate usage of Database::select to SelectQueryBuilder in ORES (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1007755 (https://phabricator.wikimedia.org/T312454) (owner: 10MPGuy2824) [11:54:49] (InfServiceHighMemoryUsage) firing: (2) High Memory usage detected in Inference Service - https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Alerts#Inference_Services_High_Memory_Usage_-_InfServiceHighMemoryUsage_alert - https://alerts.wikimedia.org/?q=alertname%3DInfServiceHighMemoryUsage [12:32:10] hello folks! [12:32:24] hey Luca [13:09:40] Morning all [13:10:38] hello hello [13:10:50] heya [13:48:35] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1007903 [13:48:55] my soul is sad and lost in tears and yaml [13:51:04] jokes aside, what happened - I tested the new control plane (kserve 0.11) in staging killing an articlequality pod, and the new one didn't come up. In the logs I found out that the kube api tried to call the mutate wehbook for kserve, that answered with "NOOOOOOOO" [13:51:26] and in the kserve control plane logs I found out that the yaml parsing code didn't like the missing comma [13:54:44] That... should have been more obvious :) [13:54:59] As in: the softwrae should make it so, not that you should've caught it sooner. [13:58:58] you are still a believer [13:59:10] :D [13:59:11] I'd say optimist :) [14:01:13] elukey: Can I merge the alert patch? I've deleted our setup and added a README with pointers [14:02:01] +1ed! [14:05:57] I have also https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1007907 [14:24:15] +1d right back :) [14:27:28] <3 [14:36:06] I see k8s is still sending us love [14:43:02] YAML, the "love" that keeps on giving. [14:44:58] it is getting better over time, kserve updates have been smooth so far [14:45:06] (also they just released 0.12) [15:18:10] Probably because they saw you working on 0.11.2 [15:25:39] ahahah yes [15:26:05] jokes aside, I think they have moved to a quarter-ish release cadence [15:26:20] so we may think about upgrading every 6 months or so? [15:29:46] Sounds doable. [16:08:27] ok finally kserve 0.11's control plane fully works on staging [16:09:05] buuut I found a race condition that will require some extra work before the next deployment [16:09:49] klausman: (if you have time) - do you recall that before my paternity leave I changed the default bundle used by the storage initializer to use the PKI+Puppet crt, and not the puppet one? [16:10:06] Vaguely, yes [16:10:22] in puppet private, that eventually translated into setting AWS_CA_BUNDLE for the storage-init container [16:11:00] now that change was never rolled out to all namespaces, so the corresponding setting (that is stored in a secret) still points to the puppet ca crt [16:11:38] so if we deploy the new storage init now it fails, since it can't find the old file (wmf-certificates changed, now the Puppet_CA file crt is renamed to a versioned one etc..) [16:11:47] Oh, damn [16:12:02] Why was it never rolled out? [16:12:04] so I'd need to rollout all the new Secret updates, they will not trigger a deployment of pods [16:12:50] So the only way to trigger a redeploy would be to delete them to force a restart? [16:12:52] I think it is not a problem if we don't change the docker image of the storage init, so at the time I probably thought that it would have been ok to go out with any deployment [16:13:28] so before we keep going with kserve 0.11, we need to rollout the secret update everywhere [16:13:41] shouldn't take much but I'll do it on monday [16:13:48] going to update all staging namespaces now [16:13:58] yeah, sounds good. If you want any help, ping me. [16:14:08] really subtle side effect [16:15:43] It's always a bit annoying when changes like that can linger undeployed [16:18:04] ok staging should be good [16:19:35] :+1: [16:20:41] also, not related https://gitlab.wikimedia.org/repos/releng/blubber/-/merge_requests/61 [16:20:57] so blubber always uses --break-system-packages for bookworm [16:21:07] (when installing pip packages) [16:21:16] Mh. Interesting choice [16:25:01] yeah indeed [16:37:20] Thinking that rationale to the end, why use the distro Python at all? Install something to /usr/local/python3.x and use that for whatever is running in the container. [16:48:09] logging off, have a nice weekend folks o/ [16:48:40] I think it is probably the meaning of --break-system-packages, if no collision happens by chance etc.. (like you apt-install something non-python and python packages are pulled into) [16:48:44] aiko: o/ [16:48:45] you too! [16:52:42] I'm, heading out as well, splendid weekend to y'all [16:53:05] o/ [17:02:54] heading our as well o/ [19:34:09] night all! [19:34:13] hgave a great weekend