[08:02:18] (03CR) 10Elukey: [C: 03+2] Removes old unused packages. [research/ores/wheels] - 10https://gerrit.wikimedia.org/r/762894 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [08:03:40] (03PS1) 10Elukey: Bump the wheels submodule to pick up the latest changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/763178 (https://phabricator.wikimedia.org/T300195) [08:04:00] (03CR) 10Elukey: [V: 03+2 C: 03+2] Bump the wheels submodule to pick up the latest changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/763178 (https://phabricator.wikimedia.org/T300195) (owner: 10Elukey) [08:11:31] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10elukey) @Halfak great work, deployment to Be... [08:12:14] just deployed ores in beta, all good with the latest changes from Aaron [08:16:08] -- [08:16:16] the apply() meetup is in https://www.youtube.com/watch?v=GXqK6HlYG6M&ab_channel=Tecton [08:16:32] there are a lot of interesting talks, going to watch some of them [08:31:42] kevinbazira: o/ [08:31:55] do you know why we have "bnwiki-reverted" in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/762533/2/helmfile.d/ml-services/revscoring-editquality/values.yaml ? [08:32:08] I am confused, I thought it was either goodfaith or damaging [08:38:07] elukey: o/ [08:38:27] nope, I also saw it yesterday and was wondering the same :) [08:38:35] We'll ask Andy. [08:39:04] ack :) [08:39:38] kevinbazira: if you want we can proceed with your change, I left some comments about a simplification in the config that I made yesterday [08:39:49] it should remove some repeated stuff [08:39:53] (and simplify) [08:40:11] thanks. let me fix it. [08:40:53] super, you can skip the rebase part, I wrote it before seeing the reverted model name :D [08:51:45] mmm so "reverted" seems to be a model type, at least according to what is being uploaded to swift [08:54:28] I see from https://ores.wikimedia.org/v3/scores/ [08:54:31] okok [08:56:33] (going afk for a bit to get a coffee, bbiab) [09:10:19] elukey: when you're back: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/762777/ [09:10:41] ^ I have added a second patch that uses simplified settings. [09:18:50] kevinbazira: thanks! Merged, currently running puppet on the deploy node [09:18:58] do you want to deploy? [09:19:24] yep, thank you for merging. let me deploy now. [09:27:19] both eqiad and codfw deployments have been completed successfully. [09:27:45] diffs looked fine on my end. is everthing ok on your end elukey? [09:29:43] kevinbazira: checking in a sec [09:29:50] kevinbazira: did you check the pods? [09:29:58] if they are up etc.. [09:30:10] hadn't yet [09:30:20] doing so right now [09:33:33] On the ml-sandbox we use "kubectl get po -A" to check. so i've run it and got: [09:33:41] "The connection to the server localhost:8080 was refused - did you specify the right host or port?" [09:34:01] do you use a different command to check? [09:34:13] kevinbazira: nono I mean the production pods [09:34:22] ahh sorry [09:34:37] added some info in https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/Deploy#Test_your_model_after_deployment about how to do it [09:37:35] thanks for the docs. the pods are up and running :) [09:37:37] NAME READY STATUS RESTARTS AGE [09:37:38] bswiki-damaging-predictor-default-6jfpt-deployment-757d6c927k7m 2/2 Running 0 12m [09:37:38] bswiki-goodfaith-predictor-default-f9t76-deployment-649968r8gq4 2/2 Running 0 12m [09:37:38] cawiki-damaging-predictor-default-chdgz-deployment-79cf786dhtdb 2/2 Running 0 12m [09:37:38] cawiki-goodfaith-predictor-default-6tcv9-deployment-8cfb7db7bjc 2/2 Running 0 12m [09:38:10] nice! [09:41:02] so let's also check the inference.discovery.wmnet endpoint [09:41:09] I usually do it from ml-serve-ctrl1001 [09:41:53] ah ok so the calls return a 500 [09:42:04] but this will be fixed by Andy's patch [09:42:23] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/762937 [09:42:40] (at least this is my understanding) [09:54:42] yep, that will most likely fix the 500 error [10:01:51] I created https://wikitech.wikimedia.org/wiki/Machine_Learning/Onboarding (shamelessly copied from the Data Engineering team :D) [10:01:54] aiko: --^ :) [10:08:10] I'd also like to move the Deploy page to something more official [10:08:32] maybe MachineLearning/LiftWing/Deploy ? [11:08:39] elukey: now I can login to horizon! but there is a msg on the top right shows -- Error: Unable to retrieve limits information. Details 'totalVolumesUsed' [11:09:07] mmm weird, it is probably a horizion issue (it is not always super stable :D) [11:09:20] in the top left corner you should have a drop down menu with projects [11:09:29] I added you to machine-learning and deployment-prep [11:09:33] can you see them? [11:11:18] the next thing is to see if you can ssh to ml-sandbox.machine-learning.eqiad1.wikimedia.cloud :) [11:11:33] this is a VM that we use to test models with a basic kubernetes set up [11:11:48] (it is expensive to run it locally so we have a common place where devs can go) [11:14:25] Oh I found them! machine-learning and deployment-prep. they are in the bastion drop down menu [11:14:31] nice! [11:18:01] aiko: can you ssh to the ml-sandbox as well? [11:20:29] elukey: cool yes I can access now [11:20:54] elukey: aikochou@ml-sandbox [11:21:23] super :) [11:22:10] updated the guide as well [11:22:38] going afk for lunch! [11:22:50] aiko: we can meet later on if you want, lemme know! [11:23:10] elukey: ok! :) [14:24:38] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) [14:25:37] chrisalbon, accraze - created https://phabricator.wikimedia.org/T301878 with what we discussed yesterday, lemme know if it is clear [15:20:39] * elukey bbiab [15:53:48] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Sort of related: [[ https://docs.google.com/document/u/1/d/1uGDTExRXFw-w51F8WkmQNZwP_pCZxiu6jZv3ZPKlvdM/edit | (WIP) MediaWiki Event Carried State Tran... [15:55:11] o/ [15:57:12] o/ [15:58:56] elukey: just saw you asking about the reverted models, these are suuuuper old editquality models, theres only a handful of them but they should run w/ our isvc model-servers [16:00:00] morning all! [16:00:21] mornin! [16:08:42] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10elukey) >>! In T301878#7715052, @Ottomata wrote: > Sort of related: [[ https://docs.google.com/document/u/1/d/1uGDTExRXFw-w51F8WkmQNZwP_pCZxiu6jZv3ZPKlvdM/edit |... [16:10:43] accraze: yep no problem! I found them later on, this morning I was confused :D Let's update the patch and deploy, looks gooD! [16:10:59] I thought it would have broken the assumption for the model names, but it is generic enough [16:12:47] nice! [16:13:06] also will add a constant to the editquality 500 fix real quick and then it should be ready [16:13:28] accraze: it is someting that we can do later, it was just cosmetics [16:13:46] ah gotcha [16:23:05] (03PS2) 10Accraze: editquality: fix double preprocess call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762937 (https://phabricator.wikimedia.org/T301412) [16:23:29] elukey: meh just fixed it while i was thinking about it [16:24:52] accraze: ah ok.. FEATURE_VAL_KEY maybe ? [16:24:55] what do you think? [16:25:31] (03CR) 10Accraze: editquality: fix double preprocess call (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762937 (https://phabricator.wikimedia.org/T301412) (owner: 10Accraze) [16:25:49] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762937 (https://phabricator.wikimedia.org/T301412) (owner: 10Accraze) [16:26:48] nevermind :) [16:26:57] lol :D [16:27:14] too fast! [16:27:51] honestly it would be great to make a base revscoring model server [16:28:10] 3/4 isvcs work very very similarly now [16:28:17] yep I agree [16:29:00] (with the exception being articlequality) [16:35:07] (03Merged) 10jenkins-bot: editquality: fix double preprocess call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762937 (https://phabricator.wikimedia.org/T301412) (owner: 10Accraze) [16:35:44] sorry elukey, I hadn't yet seen your suggestion: FEATURE_VAL_KEY 🙏 [16:39:47] kevinbazira: I am totally mad at you now! :D :D Joking, no problem it was just a thought :) [16:40:09] 🙏 🙏 🙏 [16:40:33] xD [16:41:42] accraze: do you want to deploy the new models? [16:42:01] I mean the ones containing reverted etc.. [16:42:22] ahh yeah lemme update CR real quick [16:45:20] im gonna update the image version in the CR too if that's cool? [16:47:39] ugh hold up... merge conflict while rebasing locally [16:50:21] yep Kevin added models this morning [17:04:33] elukey: new patch is up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/762533 [17:10:25] accraze: ready to deploy :) [17:13:42] deploying now [17:16:51] all pods seem to be up and running, going to test prediction now [17:19:01] awesome! all new isvcs work great, bnwiki-reverted too! [17:19:56] NICE! [17:20:07] super :) [17:20:28] ok so full-steam ahead on editquality migration? [17:20:34] accraze: we could simplify a little the new config format but it seems a good compromise for the moment, what do you think? [17:20:51] yeah i think the new abstraction works well for now [17:21:17] +1 for loading all the models [17:21:34] yeah, load them all [17:21:41] there is still the question mark about the load testing, but once we find a good compromise for one we'll roll it out in every place [17:26:23] just counted we have 9 models on lift wing now, only 101 to go ;) [17:28:21] but we're getting good processes in place now, which is great [17:40:23] (03PS2) 10Accraze: editquality: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) [17:42:27] (03CR) 10jerkins-bot: [V: 04-1] editquality: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [17:57:07] (03PS3) 10Accraze: editquality: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) [17:58:53] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10ACraze) [17:58:55] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10ACraze) 05Open→03In progress [18:02:18] (03CR) 10Elukey: editquality: handle http bad request (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270) (owner: 10Accraze) [21:41:15] (03PS4) 10Accraze: editquality: handle http bad request [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/762933 (https://phabricator.wikimedia.org/T300270)