[05:37:32] 06Machine-Learning-Team, 05Goal: Q4: Lift Wing Python Package - https://phabricator.wikimedia.org/T359140#9648142 (10isarantopoulos) This is the repository where the project will be hosted: [[ https://github.com/wikimedia/liftwing-python | https://github.com/wikimedia/liftwing-python ]] [05:37:50] Good morning! [05:46:54] 06Machine-Learning-Team: Create an examples directory in the repository and add a basic README.md - https://phabricator.wikimedia.org/T360593 (10isarantopoulos) 03NEW [06:09:36] I missed adding the CI pipeline for test in a previous commit for the huggingface image https://gerrit.wikimedia.org/r/c/integration/config/+/1013161 [06:11:04] (03PS3) 10Ilias Sarantopoulos: fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) [07:05:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [07:19:51] (03Merged) 10jenkins-bot: fix: install pyopencl in llm and article-desc [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1013036 (https://phabricator.wikimedia.org/T360212) (owner: 10Ilias Sarantopoulos) [08:00:29] (03PS1) 10Kevin Bazira: Makefile: add support for articletopic-outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) [08:12:00] (03PS2) 10Kevin Bazira: Makefile: add support for articletopic-outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) [08:16:57] (03CR) 10Kevin Bazira: "To make reviewing easier, here are the commands I used to test the articletopic-outlink model-server build:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [10:43:07] * isaranto lunch! [11:29:30] o/ [11:30:47] hello everyone [11:31:06] hi! [11:35:11] just and image update patch -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013269 [11:55:37] +1'd [12:00:23] Danke! [13:19:25] hello folks! [13:19:40] o/ Luca! [13:23:47] Buon giorno :) [13:28:27] (03CR) 10Ilias Sarantopoulos: "Nice work! I ran this and it works like a charm with one addition: One need to activate the virtual env and also append the cwd to the pyt" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:29:07] SRE gave us the green light to increase the tmpfs size on the registry \o/ [13:29:39] yeah I just saw it! [13:29:57] kevinbazira: nice work with the makefile, really easy to run! [13:30:23] isaranto: thanks for the reviews :) [13:32:49] elukey: nice work with the image, your description in the task is 🔝 which helped a lot the discussion [13:34:29] <3 [13:39:15] 06Machine-Learning-Team: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images - https://phabricator.wikimedia.org/T359067#9649226 (10akosiaris) >>! In T359067#9627299, @elukey wrote: > @akosiaris thanks a lot for all the details, really appreciated, now I have a better understanding o... [13:40:03] I'm having an issue deploying stuff on ml-staging as it seems we are out of cpus https://grafana-rw.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s-mlstaging [13:40:38] I deployed article-descriptions namespace and got ` 0/4 nodes are available: 2 Insufficient cpu, 2 node(s) had taint` and a pod in pending [13:41:50] also keep in mind that we have duplicate deployments for article-descriptions both in the namespace with the same name but also in the experimental namespace. I want to keep the latter still around for a while as I can change resources at will and run load testing [13:42:59] klausman: --^ Do you have time? [13:43:11] Yep [13:44:20] Ah, yes, out of CPU resources in staging. We've run into it before, prompting my suggestion that we should look at utilization 9actual CPU usage vs. alloc in the charts), because this will hit us in prod, too, eventually [13:45:01] ack [13:45:40] (03PS3) 10Kevin Bazira: Makefile: add support for articletopic-outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) [13:46:39] isaranto: I see the nllb200 pod running in staging, is that in use atm? [13:46:51] actually, twice. two different deployments [13:47:03] another thing I was thinking to do is to remove some deployments from staging. However the end result is a merge of both prod and staging yaml files. If we change the chart a bit we could make it work [13:47:15] that one we can remove! [13:47:42] Ok, I will remove the older deployment of nllb200 [13:48:53] (03CR) 10Ilias Sarantopoulos: "I tested the new one and it worked like a charm! really easy!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:49:06] klausman: let's remove both I would say since no1 is using it at the moment [13:49:09] (03CR) 10Ilias Sarantopoulos: [C:03+1] Makefile: add support for articletopic-outlink [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:49:17] (03CR) 10Kevin Bazira: "Thank you for suggesting this great idea of adding both the predictor and transformer to the Makefile. I have implemented it." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:49:21] alright [13:49:34] shall I submit a patch for it? [13:49:40] lol wikibugs has a nice lag [13:49:42] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1012663 (https://phabricator.wikimedia.org/T360177) (owner: 10Kevin Bazira) [13:50:12] aiko: o/ [13:50:35] how big is rr-multilingual? I see that CI takes a ton of time to push a new image, and the last time it failed [13:50:54] ah ok 10GB :D [13:51:05] elukey: yes 10G [13:51:14] is it because of pytorch/rocm? [13:51:15] it has pytorch rom that's why [13:51:19] ahhh okok [13:51:28] yes so I need to work on the base image asap [13:51:53] elukey: is anything I can help with? [13:52:17] isaranto: I am half done with the patch already, I'll send it once done [13:52:22] aiko: nono all good, if CI fails it may be due to hitting docker registry's limits :( [13:52:22] ack! [13:54:30] elukey: ok I want to help build the base image [13:55:27] aiko: If you are ok I can make a draft and we can discuss/review it together [13:55:58] elukey: yes! coool thank you [13:56:17] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013317 [14:01:01] isaranto: I don't quite understand the comment? You intend to remove the nllb200-cpu entirely in the future? [14:02:22] sry. yes exactly. your patch is great (maybe comment was not needed at the moment) [14:02:28] akc! [14:02:45] but since it is not used we can leave it for now and revisit in the future [14:02:56] merging and pushing in a Thessaloniki minute [14:03:38] lol [14:06:09] still no luck with https://github.com/pytorch/pytorch/issues/121506 however links in https://download.pytorch.org/whl/rocm5.5/ now resolve fine but not the same case with other versions (e.g. 5.6, 5.7) [14:06:37] isaranto: pushed to staging, but found issue with the -serve bits, new patch in a sec [14:07:29] aiko: created https://phabricator.wikimedia.org/T360638, lemme know if what I wrote makes sense! [14:07:33] (when you have time) [14:08:36] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 (10elukey) 03NEW [14:10:58] elukey: where is the production-images repo located? [14:12:15] 06Machine-Learning-Team: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638 (10elukey) 03NEW [14:12:20] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013321 is ready for your scrutiny [14:12:55] 06Machine-Learning-Team: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9649639 (10elukey) For reference, let's keep in mind https://github.com/pytorch/pytorch/issues/121506 [14:13:18] Oh dear, now the Grafana dash shows us even more CPUs in the red (30 instead of 5...) [14:14:00] ah, that was because two pods for art-desc (old and new) were up [14:15:07] I think using rocm 5.x is fine [14:16:24] ah I found it on gerrit [14:21:48] super thanks for checking :) [14:28:54] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649735 (10JMeybohm) Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM increase. No need for extra steps [14:34:03] Good morning all [14:34:46] morning :) [14:35:59] 06Machine-Learning-Team: Create an examples directory in the repository and add a basic README.md - https://phabricator.wikimedia.org/T360593#9649769 (10isarantopoulos) @Mercelisvaughan As a follow up to the first Pull request you can create another one where you'll add data validation using [[ https://docs.pyda... [14:36:10] Morning Chris! [14:47:17] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649789 (10MoritzMuehlenhoff) >>! In T360637#9649735, @JMeybohm wrote: > Sounds good to me. I'd say you can just depool one of the active registry nodes and restart that VM for the RAM incr... [14:49:21] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9649813 (10akosiaris) [14:52:34] morning o/ [15:05:03] going afk folks, will be back to check later in case you need me anything , otherwise have a nice evening o/ [15:06:36] isaranto: about the two env vars (CT2...) I removed them because when I diffed prod, it showed them not having been deployed. [15:25:21] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9649946 (10Jhancock.wm) Found the drive as absent in iDRAC. Physically, the drive is there but is not blinking like the other drives.... [15:35:55] (03CR) 10AikoChou: [C:03+1] "Great work! I tested the patch and it worked flawlessly :D I just have some minor comments in the README.md" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) (owner: 10Ilias Sarantopoulos) [15:53:18] elukey: I think the istio change is ready-ish, but review can wait [15:53:44] I am heading out now (since I skipped lunch), so I'll look at it again tomorrow. [15:54:27] ack! [16:13:39] aiko: I created https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1013335 as first draft [16:13:58] very straightforward approach, not sure if it is the best [16:16:25] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650370 (10elukey) [16:28:16] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650464 (10ops-monitoring-bot) VM registry1003.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [16:35:45] elukey: o/ I'll check it out tomorrow! [16:37:27] sure! [16:39:58] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650535 (10ops-monitoring-bot) VM registry1004.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [16:40:55] klausman: don't know why they didn't appear but we need them. I added my +1 [17:03:42] 06Machine-Learning-Team, 06serviceops: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9650697 (10elukey) ` elukey@ganeti1027:~$ sudo gnt-instance list | grep registry registry1003.eqiad.wmnet kvm debootstrap+default ganeti1026.eqiad.wmnet running 6.0G reg... [17:03:53] going afk folks! [17:03:59] talk with you tomorrow [17:16:40] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9650781 (10mfossati) >>! In T358676#9645058, @kevinbazira wrote: > To prevent potential DOS vulnerabilities, we need to establish a limit on the number of images that c... [18:08:17] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 06Moderator-Tools-Team: Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - https://phabricator.wikimedia.org/T358344#9651202 (10jsn.sherman) [18:09:24] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Automoderator, 06Moderator-Tools-Team: 14Enable Language-agnostic revert risk model in ORES for Indonesian Wikipedia - 14https://phabricator.wikimedia.org/T358344#9651217 (10jsn.sherman) →14Duplicate dup:03T352769