[00:09:34] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Observability-Logging, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10Volans) @elukey thanks a lot for this live data! That's awesome! I went to the Data Engineerin... [08:03:30] 10Lift-Wing, 10Machine-Learning-Team: Match model-server dockerfiles with blubber files - https://phabricator.wikimedia.org/T289127 (10kevinbazira) With the advent of a new syntax directive that enables users to run blubber files using the `docker build` command, we will not need to update dockerfiles to match... [08:04:20] good morning :) [08:17:29] (03CR) 10Kevin Bazira: [C: 03+2] articlequality: add support for sfn templates to fawiki [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/856500 (https://phabricator.wikimedia.org/T319373) (owner: 10Kevin Bazira) [08:17:53] (03CR) 10Kevin Bazira: [V: 03+2 C: 03+2] articlequality: add support for sfn templates to fawiki [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/856500 (https://phabricator.wikimedia.org/T319373) (owner: 10Kevin Bazira) [08:29:43] kevinbazira: o/ [08:29:53] elukey: o/ [08:30:00] I checked https://ores-beta.wmflabs.org to see if beta was online and it seems down [08:30:03] sigh [08:30:28] oh ok ... I was planning to deploy on it [08:30:31] :/ [08:30:54] yeah I guessed that [08:30:59] lemme see if it is a transient issue [08:31:05] (the host is deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud) [08:31:07] IIRC [08:31:16] Ok, thanks! [08:32:50] mmm weird the host seems working [08:33:09] maybe it is only the endpoint [08:36:04] works now :) https://ores-beta.wmflabs.org/ [08:36:15] the web-proxy in horizon for deployment-prep wasn't updated [08:36:22] kevinbazira: green light to test [08:36:41] Thanks for the green light Luca. [08:36:47] Deploying on beta now ... [08:38:00] kevinbazira: remember the submodule update --init thing [08:38:08] yep [08:38:11] that is important in this case (after updating the repo etc..) [08:38:12] super [08:38:16] happy deployment :) [08:38:36] isaranto: o/ if you want to know more about --^ we can add details [08:38:54] Kevin is going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/856500/ to our "beta" environment [08:39:09] that is called "deployment-prep", it is in Horizon (shared with other teams) [08:39:47] the change is related to a git submodule changed: a new model version has been pushed to the submodule's repo, and now we want to update it [08:40:00] (update the main repo I mean) [08:40:09] the ORES set up is.. not great [08:40:20] basically a big deploy repo with a ton of submodules [08:40:36] each of them lives in a separate repo, and hosts model binaries via git-lfs [08:41:36] * elukey coffee [09:08:21] kevinbazira: o/ [09:08:29] so the deployment steps are all in https://wikitech.wikimedia.org/wiki/ORES/Deployment#Deploy_to_beta [09:08:41] in theory they should be up to date, lemme know if anything doesn't work [09:12:51] Yep, those are the steps I am following. [09:13:05] Ack! [09:13:30] When I got to "git pull && git submodule update --init" apparently there is a commit required. I've sent the screenshot on slack. [09:14:28] kevinbazira: are you on deployment-deploy03 ? [09:14:42] yep [09:14:45] as far as I remember indeed u have to make a commit once u update a submodule [09:15:29] isaranto: yep yep basically https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/856500/ [09:15:43] kevinbazira: now I get what you mean, you are saying that git status shows some diffs [09:15:58] yep [09:16:16] okok maybe the repo wasn't clean [09:16:19] lemme check [09:18:11] kevinbazira: ok so I did the following [09:18:27] 1) Removed the extra DEADJOE file, not sure who placed in there. I suspect some releng test [09:18:34] 2) git reset --hard origin to clean up [09:18:43] 3) git submodule update --init [09:18:46] and now it looks clean [09:18:49] lemme know [09:20:08] great. thanks a lot. let me proceed. [09:24:34] kevinbazira: let's also run https://wikitech.wikimedia.org/wiki/ORES/Deployment#Running_tests afterwards to sanity check the whole status of ORES [09:25:13] Ok... I'll do that. In the docs, do these messages mean the same thing? [09:25:13] 4. Record the NEWHASH at the top of git log -1 [09:25:14] 5. Record the new revision (NEWHASH) [09:26:24] I think so yes [09:27:08] we can probably skip the !log to wikimedia-cloud to be honest [09:27:22] they don't really own deployment-prpe [09:27:23] hihi the docs are confusing. Ok, skipped :) [09:28:09] I removed the steps from the dos [09:28:12] *docs [09:30:19] Thanks, deployment is running ... [09:32:09] deployment completed, running sanity checks now ... [09:33:32] All assertions passed. [09:35:37] nice! I'd test also a fawiki prediction just to be sure that it works as expected [09:54:47] The fawiki prediction on beta: https://ores-beta.wmflabs.org/v3/scores/fawiki?models=articlequality&revids=35130784 [09:54:47] matches the results we got when evaluating the new model: https://phabricator.wikimedia.org/T317531#8362584 [09:54:47] The fawiki prediction in prod is still showing results of the old model: https://ores.wikimedia.org/v3/scores/fawiki?models=articlequality&revids=35130784 [09:54:47] now going to deploy to prod ... [10:06:05] deployment running ... [10:10:00] canary smoke tests passed ... proceeding to deploy to all groups. [10:18:55] Woohoo! Prod deployment completed and fawiki prediction https://ores.wikimedia.org/v3/scores/fawiki?models=articlequality&revids=35130784 now matches the results we got when evaluating the new model: https://phabricator.wikimedia.org/T317531#8362584 [10:23:16] kevinbazira: nice work! [10:23:57] Thanks to you for your help, always 🙏🙏🙏 [10:27:08] Morning! [10:27:40] Got a bit sidetracked watching the Artemis launch :) [10:30:31] kevinbazira: nice!!! \o/ [10:43:23] 10Machine-Learning-Team: Deploy new fawiki articlequality model to ORES and LiftWing - https://phabricator.wikimedia.org/T319373 (10kevinbazira) [10:47:23] let's remember to check https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m after a deployment as well [10:48:02] also, kevinbazira - if you have time let's run the same tests that we ran in beta on deploy1002 [10:48:08] just to verify that all works etc.. [10:53:29] basically [10:53:29] elukey@deploy1002:~$ httpbb /srv/deployment/httpbb-tests/ores/test_ores.yaml --host=ores1001.eqiad.wmnet --http_port 8081 [10:53:32] Sending to ores1001.eqiad.wmnet... [10:53:33] all good :) [10:53:36] PASS: 124 requests sent to ores1001.eqiad.wmnet. All assertions passed. [11:13:17] \o/ [11:32:55] * elukey lunch [11:33:54] ditto [12:41:10] (03CR) 10AikoChou: Refactor revscoring model servers (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:10:23] /o anybody use the Gerrit review tool in pycharm or vscode? I’m trying to set it up but it wont connect (i get a 404) [13:40:43] nope, I don't use it in an editor ... the terminal usually suffices [13:55:38] Same here, I do my reviews on the web UI unless it's super ivolved [13:57:00] Good morning all! [13:59:54] Heyo :) [14:04:03] FYI, I'm switching ml-etcd1003 to DRBD temporarily, latencies will go up a bit [14:05:04] ack [14:35:52] isaranto: I use vscode to do code changes/submit patchs and use web UI to write comments. If you want to download someone else's patch, you can use git review -d changeNumber in vscode and then you can view the patch locally and test it. [14:38:36] ml-etcd1003 is back to "plain" disks [14:39:05] thanks! [14:55:07] aiko: thanks! I was trying to set up the gerrit plugin but couldnt, so i’ll go good old terminal for checking out patches :) [14:55:26] (03PS9) 10Elukey: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [15:04:04] (03PS10) 10Elukey: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [15:04:34] (03CR) 10Elukey: Refactor revscoring model servers (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [15:06:18] (03PS11) 10Elukey: Refactor revscoring model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) [15:07:50] aiko: thanks for the review :) [15:36:01] I see we only have 1 commit per patch in gerrit. How does the team works about this? do we do `git amend ` on our initial commit or do we squash the rest of the commits into 1? [15:46:41] isaranto: we use `git commit --amend` when we amend the patch [15:46:55] https://www.mediawiki.org/wiki/Gerrit/Tutorial#Submit_a_patch [15:46:59] https://www.mediawiki.org/wiki/Gerrit/Tutorial#Amending_a_change_(your_own_or_someone_else's) [15:48:38] ack [16:46:04] * elukey afk for a bit [16:56:18] where would be the best way to ask if others are facing issues building docker images on M1 macs with blubber? [17:00:19] * best place not way :D [17:02:24] isaranto: I found someone has reported the issue on Phab https://phabricator.wikimedia.org/T318866 [17:02:53] is it the same issue? [17:04:46] Guys. [17:04:55] aiko: exactly same issue (on a different step). will follow the gitlab MR from releng. Thanks a lot! [17:05:01] I just successfully queried the NLLB200 model on AWS [17:05:36] There is still a ton of stuff to do (request routing, ACLs etc), but at least I can say that the model deployment works in principle [17:06:36] $ aws sagemaker-runtime invoke-endpoint --region us-east-1 --endpoint-name nllb200-staging --content-type application/json --accept application/json --body '{ "uid": 1, "sourceText": "In their natural, unprocessed, whole grain form, cer eals are a rich source of vitamins, minerals, carbohydrates, fats, oils, and pro tein. When processed by the removal of the bran, and germ, the remaining [17:06:38] endospe rm is mostly carbohydrate.", "sourceLanguage": "eng", "targetLanguage": "ibo"}' result > /dev/null;cat result;echo [17:06:40] {"id": 1, "translatedText": "N'ụdị ha, n'ozuzu, cer eals bụ isi iyi bara ọgaranya nke na-agụnye, carb, fats, na pro tien. Mgbe nke na-ekwu, na-ekwu na ọ bụ carbīrate."} [17:06:46] Sorry for the mangled paste [17:07:09] Anyone here speak Igbo? [17:09:24] chrisalbon: ^^^ [17:17:31] nice! [17:17:51] Now I "just" need to get external requests routed there [17:18:22] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Observability-Logging, and 2 others: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10Volans) @fgiunchedi @elukey I seeing some strange behaviour of the data in the dashboard, not sure... [17:54:17] (03CR) 10AikoChou: [C: 03+1] "one more comment :) other than that, looks good to me!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/856520 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [18:25:09] * elukey afk! [18:32:29] same [20:34:02] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Isaac) @Ottomata recognizing that this might be long past the time when you'...