[05:43:50] hello folks! [05:44:12] I have a doctor appt this morning, will join later on [08:42:12] hello folks, I am mostly back, got some eye drops from the doctor so I'll be on and off :) [08:42:15] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlstaging [08:42:18] metrics are flowing :) [08:42:25] (staging cluster) [08:46:05] 10Machine-Learning-Team, 10ORES: ORES gives internal error on an invalid model_info parameter - https://phabricator.wikimedia.org/T279271 (10elukey) Pull request merged, thanks a lot for the work! Let's keep the task open until we deploy ORES. [08:55:26] \o [08:55:44] Office: rearranged. Body: sore all over. [08:56:07] o/ [08:56:08] :) [08:56:46] aiko: o/ I found out my commit to fix the logging issue in ORES - https://github.com/wikimedia/ores/commit/a00ab76d9b2755842ad832e1298474b0867fb475 - we didn't see it applied since it is a commit for the ores submodule, not the wheels one (that we modified) [08:56:54] makes sense [08:58:18] Ah, the joys of git submodules. I have been bitten by that in the past as well. Luckily, my current fave language (go) handles things differently, so you rarely if ever have submodules. [08:58:48] submodules also really don't integrate all that well with having separate branches, IMO [09:00:49] I hate them as well :) [09:03:39] Ah I see.. there are too many submodules haha [09:07:32] I just merged a pull request from a member of the community for the ORES submodule, I'll try to deploy the changes later on in the week [09:08:12] aiko: I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/803457/2/modules/admin/data/data.yaml to grant you access to the ORES nodes, it will take a bit but things are in progress :) [09:10:08] aiko: about Ray workers - I was reading the upstream doc yesterday, and I noticed that they mentioned having multiple/different models in the same pod [09:10:37] so I am now wondering if we could group some small wikis into the same pod, using a transformer or similar to route the request to the right model [09:10:51] maybe something for the future, but it could be interesting to check [09:11:09] (of course if we can use a transformer etc.., otherwise it is fine predictor only) [09:17:10] How much de-dupe would that mean for memory and disk? [09:17:37] (I mean over the non-ML/LW-specific bits we would dedupe) [09:19:17] in theory the models will be loaded in separate python processes [09:19:25] in theory, this is the bit that I have no idea about [09:20:00] How much data do we ship with each model? I vaguely remember some chunky aspell stuff. [09:20:00] so the new pod would need a little more resources [09:20:41] IIRC aspell debian packages, but we install all in the docker image, and the language processing module [09:21:34] I feel like especially R/O data should be shared as much as possible. Docker layering takes care of some of that. [09:22:31] in theory all the aspell and nltk libs should be sharable between processes [09:23:12] I need to set some time aside and examine the running pods in eqiad, to see what resources they actually use in the quiescent state, and how much of that is shared between pods [09:26:25] sure makes sense [09:44:22] elukey: about the ticket, thanks for that :) [09:44:39] About ray worker, yeah I have also read about the use case to have different models in the same pod. The idea sounds good. Maybe we could discuss it more in-depth. About the models, I think you're right, they will be loaded in separate python processes. [10:06:11] I had a chat with Riccardo and a possible hack for revscoring could be to use https://requests-mock.readthedocs.io/en/latest/ in our code [10:06:51] basically we'd mock urllib/request calls in a special object, that in turn would use async/await with the kserve http client [10:07:13] it is dangerous but it could be a way forward [10:07:41] I am going to send an email to Aaron to ask if the "caches" parameter in revscoring could be used instead [10:07:48] if not, we can try with request mock [10:07:57] (please shout if this is a terrible idea :D) [10:26:29] sent an email to Aaron to follow up on the 'caches' parameter [10:26:37] going to lunch now but I'll keep digging into it [12:24:40] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) eventgate is fine! Hm, we should probably have an eventgate-test instance that produces to kafka test-eqiad, eh? Unless, you can test in beta / depl... [12:33:39] mmm from my tests mocking revscoring, the extractor makes 4 HTTP calls to the mw api [12:33:59] some of them seem similar [13:03:16] aiko: interview [13:53:26] 10Machine-Learning-Team: Investigate high latencies registered by the ml-serve api control plane - https://phabricator.wikimedia.org/T310073 (10elukey) [13:57:08] "it is dangerous but it could be a way forward" - Luca Toscano, June 7th, 2022 [13:58:01] ahahaha [13:59:12] I thought I heard a "Duun duuun duuuuuun" musical cue in the distance [14:38:05] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) a:03elukey [14:55:14] chrisalbon: o/ if/when you have a moment could you review/approve https://phabricator.wikimedia.org/T310044? [14:56:53] kevinbazira_: gate+submit for the AQ ukwiki and wikidatawiki is running now [15:07:34] thanks for the merge klausman.