[01:31:09] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Fix Armenian sentence tokenization bug in the link recommendation algorithm - https://phabricator.wikimedia.org/T327371 (10kevinbazira) @kostajh, yes @MGerlach and I will work on this. [01:33:29] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Research: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212 (10kevinbazira) a:03kevinbazira Thank you for filing this @kostajh. I will work on periodically updating the link recommenda... [01:35:35] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10kevinbazira) [08:00:27] good morning folks [08:22:56] o/ [08:40:02] elukey: can I try to upgrade kserve charts to 0.10? What do I think about this,is it too soon? [08:40:46] At the moment I can't think of any other solution to proceed. Otherwise we revert the changes and go with python 3.7 for now [08:47:35] isaranto: we can leave the control plane to 0.9 in theory, and upgrade kserve's dep to 0.10 only for revscoring. It should work fine, let's try it [08:47:42] then we upgrade all images to 0.9 etc.. [08:48:01] and after the k8s migration we can probably check 0.10 as well [08:48:07] does it sound good? [08:58:22] not sure if I understand 100%. u mean just update the python package of kserve to 0.10? [08:58:36] or the chart as well? [08:58:46] yes only the python package [08:58:54] the kserve control plan is basically the chart [08:58:57] *plane [08:59:09] it deploys only the k8s controller [08:59:35] in theory we can run different versions (like we do now in staging, control plane 0.9 and docker images of model servers 0.8) [08:59:49] I would not attempt a control plane update now if possible [08:59:54] let's keep some variables fixed [09:00:06] ok going to try that [09:01:08] lemme know if I can help with the k8s upgrade [09:02:25] you have enough on your plate don't worry :) [09:02:46] I am only prepping it for the moment, it is best if we solve the current issues first to avoid fixing things [09:06:57] ack! [09:41:47] (03PS3) 10Ilias Sarantopoulos: (WIP) - feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) [09:42:06] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10isarantopoulos) There is a breaking change in kserve 0.10. as the headers object is made available in functions like `preprocess` and `predict` and we get the following error `TypeErro... [10:03:19] 10Machine-Learning-Team, 10Patch-For-Review: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) Still not able to see an HTTP request on the Lift Wing side (triggered by ChangeProp). I did the following... [10:22:35] (03CR) 10Ilias Sarantopoulos: Deployment script examples (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 (owner: 10Ilias Sarantopoulos) [10:37:32] 7~/12 [10:37:35] uff sorry :) [10:50:38] (03PS4) 10Ilias Sarantopoulos: feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) [10:52:16] (03PS5) 10Ilias Sarantopoulos: feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) [10:52:31] the checks in the above patch will fail until revscoring 2.11.10 is published [10:57:00] isaranto: approved the revscoring pull request [10:57:08] thaaank uuu [10:57:12] (03CR) 10CI reject: [V: 04-1] feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [10:57:36] (03CR) 10Elukey: [C: 03+1] feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [11:00:44] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [11:02:41] my battle with changeprop keeps going, no idea why it doesn't contact liftwing [11:15:28] 10Machine-Learning-Team: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) [11:24:17] * elukey lunch! [12:34:34] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [12:42:49] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882658 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [12:43:49] kevinbazira: o/ do you have any idea about https://phabricator.wikimedia.org/T327212 ? [12:44:00] we can discuss it later on during the team meeting of course [12:44:11] yes, I do. [12:44:22] No problem :) [12:44:30] sure we can also do it in here :) [12:44:44] what would be your suggestion? Otherwise it seems a neverending task [12:44:56] is there any chance of automation etc..? [12:45:17] (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [12:45:18] It is a recurring task that will be fully automated by TrainWing. [12:46:01] kevinbazira: sure but TrainWing is probably going to be online a year from now (we have too many things to do) [12:46:20] Happy to hear suggestions. [12:46:23] I meant if we have an intermediate solution for the moment [12:47:18] yeah this is why I am asking you, since you have the most knowledge in the project :) Do you have ideas about automating what you are currently doing? (I mean if it is possible and/or you already tried etc..) [12:48:32] The intermediate plan was to update the models every after 6 months using the same pipeline we use to train them. [12:52:18] (03Merged) 10jenkins-bot: feat: revscoring kserve upgrade to 0.10 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882689 (https://phabricator.wikimedia.org/T325528) (owner: 10Ilias Sarantopoulos) [12:52:24] kevinbazira: okok, but would it require the same manual work? [12:54:48] I am trying to figure out if we need to set up some dev time on this etc.. [12:56:47] dev time? to setup a TrainWing of sorts before TrainWing itself? [12:57:10] yes we could think about using Airflow from Data Engineering or similar [12:57:36] it could also tell us what we need in trainwing, kubeflow's pipeline are good but we don't know much about them yet [12:57:58] (also fetching data from Data Engineering land may be difficult etc.. on k8s) [12:58:46] anyway, we could also simply create a python script that every 6 months runs the pipeline scripts in batches etc.. notifying users when ready [13:01:09] a starter could be to see if https://wikitech.wikimedia.org/wiki/Add_Link#Dataset_pipeline could be ran for multiple models in parallel [13:01:26] That could be something to look into ... we'll discuss this in today's meeting so that we can get insights from the Sr. Engineer - Ilias and Chris on the initial stages of architecting TrainWing. [13:01:29] (I am very ignorant about the procedure so this is why I am asking if you have ideas) [13:05:14] ah ok :) [13:11:21] Airflow (especially since it is already supported) is a good idea. But if we put effort in TrainWing we shouldn't put that much effort in each separate project. Also it has to do with the process we follow after training e.g. evaluate model training results etc. before deployment. [13:12:13] All in all I think it boils down to this: how much time does/will it take to retrain whenever we need vs how much time does it take to create a pipeline that runs training [13:12:22] isaranto: I mentioned Airflow but we could probably avoid it, it is something that is meant to be ran every 6 months. [13:12:54] so even a quick script for the moment is fine, my aim is to avoid having Kevin spending months in training models every time :) [13:13:03] yes I agree [13:13:24] TrainWing is in my opinion too far in the future, we should start automating now our use cases and then adapt them for Train Wing [13:14:00] if you sum all the time that Kevin has been working on the model training I am pretty sure that there is enough justification to spend time in automation [13:14:17] or just scope out in a spike what can be done [13:14:37] even improving Kevin's life by cutting down some hours is worth it in my opinion [13:15:21] +1 for just creating some scripts. "scripting before automating" [13:16:02] IIUC we already have scripts, see https://wikitech.wikimedia.org/wiki/Add_Link#Dataset_pipeline, but they are not organized in a way in which we can run multiple of them at the same time [13:21:08] * elukey out for a little walk! [13:43:24] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) This task has an overlap with https://phabricator.wikimedia.org/T325528. In order to solve the errors mentioned previously we need to upgrade numpy to 1.22... [13:44:18] Updated the images to test on staging with the new changes https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/883171 [13:58:15] 10Machine-Learning-Team: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10isarantopoulos) [14:04:29] (03PS9) 10Ilias Sarantopoulos: ci: add pre-commit checks in all images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882658 (https://phabricator.wikimedia.org/T325198) [14:10:49] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10kevinbazira) [14:11:23] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10kevinbazira) @kostajh, we published datasets for all 19/21 models that passed the evaluation in this round. [14:12:04] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ci: add pre-commit checks in all images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882658 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [14:15:33] (03PS4) 10Ilias Sarantopoulos: Deployment script examples [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/881899 [14:18:12] (03Merged) 10jenkins-bot: ci: add pre-commit checks in all images [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/882658 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [14:21:17] isaranto: green light to deploy :) [14:27:59] thanks Luca! [14:35:28] here we go... [14:47:29] good news? [15:12:02] 10Machine-Learning-Team: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10calbon) a:03isarantopoulos [15:13:46] 10Machine-Learning-Team: [Liftwing testing] - Post deployment testing - https://phabricator.wikimedia.org/T327787 (10isarantopoulos) Try to use https://wikitech.wikimedia.org/wiki/Httpbb instead of python/bash scripts. [15:15:56] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10calbon) a:03klausman [15:16:15] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10calbon) a:03elukey [15:16:31] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10calbon) a:03elukey [15:33:50] 10Machine-Learning-Team: Automate publishing python packages to PyPI - https://phabricator.wikimedia.org/T325561 (10calbon) 05Open→03Resolved [15:34:15] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10calbon) 05Open→03Resolved [15:34:26] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10calbon) [15:34:28] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10calbon) 05Open→03Resolved [15:34:50] 10Machine-Learning-Team, 10Patch-For-Review: Reduce number of published docker images for revscoring models - https://phabricator.wikimedia.org/T323586 (10calbon) 05Open→03Resolved [15:35:27] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Configure LW Inference services on API GW config - https://phabricator.wikimedia.org/T323916 (10calbon) 05Open→03Resolved [15:35:29] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10calbon) [15:36:24] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, and 2 others: Create transclusion markup for ORES model card classes. - https://phabricator.wikimedia.org/T324448 (10calbon) 05Open→03Resolved [15:36:29] 10Lift-Wing, 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, and 2 others: Developing the `algo-accountability` repository - https://phabricator.wikimedia.org/T290746 (10calbon) [15:37:08] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10calbon) 05Open→03Resolved [15:37:11] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10calbon) [15:39:27] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10calbon) 05Open→03Resolved [15:40:05] 10Machine-Learning-Team: Update existing fawiki-articlequality isvc with new model on LiftWing - https://phabricator.wikimedia.org/T322614 (10calbon) 05Open→03Resolved [15:40:07] 10Machine-Learning-Team: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 (10calbon) [15:40:12] 10Machine-Learning-Team: Retrain fawiki articlequality model - https://phabricator.wikimedia.org/T317531 (10calbon) 05Open→03Resolved [15:40:14] 10Machine-Learning-Team: ORES doesn't support fawiki family template - https://phabricator.wikimedia.org/T314302 (10calbon) [15:40:19] 10Machine-Learning-Team: ORES doesn't support fawiki family template - https://phabricator.wikimedia.org/T314302 (10calbon) 05Open→03Resolved [15:40:26] 10Machine-Learning-Team: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 (10calbon) 05Open→03Resolved [15:40:28] 10Machine-Learning-Team: ORES doesn't support fawiki family template - https://phabricator.wikimedia.org/T314302 (10calbon) [16:13:11] all the servers in staging were deployed successfully ! [16:13:41] now I need to verify them one by one. a couple I have already tested work fine :D [16:14:11] super! [16:14:28] isaranto: if you want you can experiment with httpbb, I can help with a basic config [16:15:20] Ok, I'll spend tomorrow doing that then. play with httpbb for staging models and then extend it for production [16:15:56] isaranto: gimme a sec, I can create a quick config file now [16:17:54] isaranto: I see this from a quick check though [16:17:55] {"error":"AttributeError : module 'tornado' has no attribute 'httpclient'"} [16:19:50] what did u test? I got something similar on drafttopic on staging [16:20:13] sry draftquality I mean `"error":"AttributeError : module 'tornado' has no attribute 'web'` [16:20:32] goodfaith [16:20:36] enwiki [16:23:30] ah ok I think it was when triggering events [16:23:30] except tornado.httpclient.HTTPError as e: [16:23:31] AttributeError: module 'tornado' has no attribute 'httpclient' [16:24:01] ok without the event it works [16:25:34] so httpbb doesn't allow to set the Host header separately sigh [16:25:39] it grabs it from the main url [16:26:34] I have to debug the draftquality model. It seems that there is an issue with nltk [16:27:11] I'll check httpb tomorrow, if I can't make it with that a nice old good python script would do the work [16:32:29] yes also POST + etc.. doesn't seem well supported [17:32:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10kostajh) >>! In T308134#8553530, @kevinbazira wrote: > @kostajh, we published datasets for all 19/21 models that passed the evaluation in this round.... [17:36:35] going afk for the evening folks, have a good one!