[07:50:06] hello folks [07:55:28] I am publishing the kserve 0.9 docker images for the control plane [07:55:40] and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/880897 should be ok to review (helm chart's upgrade) [07:56:01] in theory after the above steps we could upgrade staging and see how it goes [07:58:56] isaranto: o/ not sure how you want to proceed with kserve 0.9 but it if it is not too horrible we could update the py3.9 change for revscoring as well [08:00:13] Hey! Lemme finish up with the monkey patch thingy. I am testing some models that failed [08:00:24] I hope there aren't many more.. [08:00:59] 10Machine-Learning-Team, 10artificial-intelligence, 10revscoring: Update revscoring dependencies to fix security reports - https://phabricator.wikimedia.org/T325366 (10elukey) 05Open→03Stalled Waiting for T325528 to be completed before proceeding. [08:05:27] ah yes yes [08:41:46] Since the packages for the language dictionaries change in bullseye I updated this file that loads german dicts https://github.com/wikimedia/revscoring/pull/531/commits/bacdab3bcf8ae5617c98e7e9f5946d17ffb32aeb and then noticed some failing tests on the serbian language , so I am investigating that as well https://github.com/wikimedia/revscoring/actions/runs/3913044227/jobs/6688451263 [08:43:36] IIRC I got some serbian lang-related failures as well at the time [08:43:40] when I was testing [09:01:14] you mean in the tests or runtime (e..g checking the model)? [09:01:41] the funny thing is that the tests passed before I switched the dictionaries for the correct ones [09:10:46] tests tests [09:14:12] αψκ [09:14:20] i mean ack [09:14:33] 😄 [09:18:55] isaranto: if you are ok I'd proceed with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/880897 in staging, to see how it goes [09:19:10] (basically upgrade the control plane) [09:19:16] I was just reviewing this [09:19:21] ah okok super [09:19:32] do all the changes in kserve.yaml come from the new CRD? [09:20:14] there are new CRDs and changes in the old ones, especially the open api specs for webhooks (I think to keep multiple api versions) [09:20:34] I keep all the custom things in the README to re-apply them every time [09:20:58] for example we need to inject stuff like [09:20:59] image: {{ $.Values.docker.registry }}/{{ $.Values.kserve.controller.image }}:{{ $.Values.kserve.controller.version }} [09:21:25] the rest is basically as close as possible to the horror that upstream provides [09:22:23] I may be wrong here or misunderstand things but my question is the following: could we add the official kserve chart as a dependency on our chart? I mean instead of committing the code in our repo [09:24:08] I mean adding [09:24:08] ```dependencies: [09:24:08] - name: kserve [09:24:08] version: 0.9.0 [09:24:08] repository: REPO_URL```in our Chart.yaml [09:25:15] dependencies work a little weird in our set up, we had to move away from them recently (Janis found some corner cases with them). Upstream seems to provide a good chart now that I see (it wasn't there last time that I checked), so we could simply replace our chart with it [09:25:23] namely importing it every time etc.. [09:25:47] I am sure there will be some custom things to add, and we could slowly / over-time send patches to them etc.. [09:26:01] I can open a task, but for the moment I'd prefer to keep our version if possible [09:27:19] (for example, we don't use the kserve-proxy thing that they deploy for fetching metrics over https, and in their chart it is mandatory) [09:33:30] 10Machine-Learning-Team: Move the kserve custom helm chart to the upstream one - https://phabricator.wikimedia.org/T327241 (10elukey) [09:33:40] created --^ [09:35:02] ok, clear [09:35:23] I mean this is my understanding of the whole thing, please let me know if it seems completely off [09:35:32] I am open to change things :) [09:39:31] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) [09:45:13] isaranto: ok to proceed then? [09:45:51] yes! [09:46:29] super thanks <3 [09:49:17] all right trying to upgrade staging [09:54:34] ok so the new controller is up [09:54:42] I tried to delete a pod and it came back nicely [09:55:56] I see some errors like [09:55:57] {"level":"error","ts":1674035667.0866048,"logger":"controller.inferenceservice","msg":"Reconciler error","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","name":"ruwiki-articlequality","namespace":"revscoring-articlequality","error":"fails to create IngressConfig: Unable to parse ingress config json: invalid character '\"' after object key:value pair" [09:56:06] not sure if they are new or not [10:02:44] do u want me to take a look? [10:09:28] nono I'll check if there is anything wrong [10:17:41] * isaranto afk lunch and errand [10:21:34] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) Noticed the following error: ` { "level": "error", "ts": 1674036814.1732135, "logger": "controller.inferenceservice", "msg": "Reconciler error", "reconciler group": "... [11:35:15] * elukey lunch! [12:18:44] I found some incompatibilities when we upgrade numpy. np.float has been deprecated - it is used in some checks. We can use np.floating instead [12:21:55] shall we do first kserve upgrade or python upgrade? [12:23:09] I would suggest to do kserve first since it a smaller one (fingers crossed) in case the python upgrade becomes blocking [12:23:25] The python upgrade seems well at the moment. I am testing the models [13:33:14] at the moment I am working on fixing the errors with the serbian dictionaries. They have to do with the updated spell-check/dictionary packages in bullseye. for some of the languages we are switching from myspell to hunspell and there seem to be mismatches between them which cause the tests to fail [13:36:00] there are some pods with `CrashLoopBackOff` errors in the namespace `revscoring-editquality-goodfaith` on ml-serve-codfw. in eqiad the same pods are ok, so could it be related to yesterday's outage? [14:25:19] back :) [14:25:33] checking ml-serve-codfw [14:29:36] thanks! [14:30:00] deleted some pods, they should get back to normal soon [14:30:13] there were some calico issues in codfw yesterday, pretty sure it was outage-related [14:31:43] all running now! [14:38:52] isaranto: re numpy - are you referring to bumping it to 1.2x right? [14:46:40] yes. numpy upgrade to 1.2x goes with kserve, all the other stuff are independent [14:51:17] okok then we can definitely focus on it before 3.9 if you want [14:51:43] but I don't want to derail your work on 3.9, so whatever you prefer of course [15:52:37] so with the comma fix now it seems that the kserve controller works fine [15:52:45] I'll do more checks but it looks good [16:31:16] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) [16:31:24] task opened --^ [16:35:04] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10elukey) Hi @diego! I opened T327302 to investigate a way to provide streams... [16:38:49] * elukey afk for a bit [17:50:01] logging off for the evening folks, ttl! [19:50:40] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10cloud-services-team: postgresql on wikilabels-db-* needs some kind of puppet management of pg_hba.conf - https://phabricator.wikimedia.org/T209396 (10fnegri) [21:44:40] 10Machine-Learning-Team, 10Cloud-VPS, 10cloud-services-team, 10Documentation: Document recommended process for installing vendor provided package upgrades in Wikimedia VPS - https://phabricator.wikimedia.org/T169247 (10fnegri)