[06:45:17] Good morning o/ [08:13:12] morning!! [08:17:37] hi Aiko! [08:27:53] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439 (10kevinbazira) 03NEW [08:29:16] 06Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123#9815533 (10kevinbazira) Thank you for reporting this issue @Dbrant. We have been able to reproduce it an... [08:39:24] Morning! [08:42:02] o/ [08:47:46] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439#9815614 (10isarantopoulos) The internal urls also behave properly so it seems that the issue is not on the Lift Wing side but has to do with... [09:20:23] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439#9815715 (10hnowlan) I suspect the fix for this is a relatively small change on the API gateway, but the change is a global one so I will nee... [09:24:27] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439#9815722 (10kevinbazira) a:05kevinbazira→03hnowlan [09:25:32] 06Machine-Learning-Team: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL - https://phabricator.wikimedia.org/T365439#9815725 (10kevinbazira) Thank you for looking into this @hnowlan. I've assigned the task to you. [09:58:04] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9815883 (10Pginer-WMF) >>! In T308164#9812963, @kevinbazira wrote: > Hi @Pginer-WMF, do you have an estimate of the expected traffic the Content and Sectio... [10:23:01] * klausman lunch [10:51:12] * isaranto lunch! [11:34:54] 06Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648#9816202 (10Pginer-WMF) [A recent report](https://kcvelaga.quarto.pub/cx-deletion-rate-variables-2024/) shows how translations based on longer source articles are more... [12:29:11] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9816362 (10isarantopoulos) Images seem to become more bloated so I am exploring the option to install pytorch-rocm with `--no-dependencies` option and handle dependencies manually either... [13:44:35] 1431 [13:44:43] definitely [13:44:44] :) [13:45:11] oops XD [13:53:32] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9816573 (10isarantopoulos) As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB. This is the list... [13:53:40] 1234 [13:53:41] :) [13:54:32] I think I'll need to move out of my house to make some space for pytorch/ROCm :( [13:55:25] * isaranto makes bad jokes [13:56:18] well one last try would be to build pytorch ourselves but don't think that it would take us in a better place [13:56:45] klausman: I remember you tried it but you managed to save only several hundres MBs, right? [13:57:14] yeah, and that was a lot of time spent on upx. Plus, it was unclear if it even would have worked (upx can break binaries) [13:58:58] yes found it ! https://phabricator.wikimedia.org/T359569#9654014 [13:59:39] isaranto: when you compress the docker image you do all the layers at once, so the 4.36GB figure may not be related to a single layer [13:59:55] we pushed torch 2.3 with the llm image, and the docker registry accepted it [14:00:04] so I am reasonably sure that it should work [14:00:30] aha you're right! it is just the closest we can get to understand layer size, right? [14:00:31] long term I'd open a github issue to either ROCm or Pytorch upstream to figure out if they can trim down the libs [14:00:37] exactly yes [14:02:08] cool I'll proceed with the patch then for 6.0! [14:08:01] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Revert Risk models are supported by caching in production - https://phabricator.wikimedia.org/T362672#9816632 (10calbon) Update: - Trying to fix up a Calico networking issue in Kubernetes - After credentials, will send patched revert risk server to ml-staging [14:09:18] 06Machine-Learning-Team, 13Patch-For-Review: Update kserve and knative-serving charts for new-style Calico network policies - https://phabricator.wikimedia.org/T365479#9816639 (10klausman) [14:16:38] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU - https://phabricator.wikimedia.org/T362670#9816658 (10calbon) Update: - Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm i... [14:18:07] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9816665 (10calbon) - Calico improvements makes the whole workflow more streamlived - [14:22:29] 06Machine-Learning-Team: Investigate temporary high latency in revscoring service for wikidata - https://phabricator.wikimedia.org/T360894#9816709 (10klausman) 05Open→03Resolved Since this has not re-occurred, I am closing the task for now. If it happens again, we can always re-open. [14:23:10] 06Machine-Learning-Team, 13Patch-For-Review: Update kserve and knative-serving charts for new-style Calico network policies - https://phabricator.wikimedia.org/T365479#9816718 (10klausman) [14:33:48] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9816857 (10calbon) [14:33:57] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9816852 (10calbon) a:03elukey [14:34:11] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9816865 (10calbon) [14:34:15] 06Machine-Learning-Team, 13Patch-For-Review: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9816866 (10calbon) [14:35:11] 07artificial-intelligence, 10Reconciliation, 10Technical-Tool-Request: Alternative, affordable, lower-barrier approach(es) to reconciliation - https://phabricator.wikimedia.org/T362149#9816881 (10Sj) A family of solutions here, or even one flexible one, would be very impactful. But @Spinster perhaps this... [14:37:04] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Moderator-Tools-Team (Kanban), 10MW-1.43-notes (1.43.0-wmf.5; 2024-05-14): Exclude first revision on page from scoring - https://phabricator.wikimedia.org/T356281#9816893 (10Samwalton9-WMF) 05In progress→03Resolved [14:38:05] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9816898 (10klausman) Repooled the machine: ` $ sudo confctl select 'name=ml-serve2002.codfw.wmnet' set/pooled=yes codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: pooled... [14:44:27] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9816920 (10Jhancock.wm) not seeing any alerts right now. I'll keep an eye on it. if it stays up until tomorrow I'll close the ticket. thanks! [14:44:37] 06Machine-Learning-Team: Investigate a way to return other 2xx status code from predict in kserve - https://phabricator.wikimedia.org/T365226#9816922 (10calbon) a:03achou [14:46:26] 06Machine-Learning-Team: Have problem with migrating to LiftWing from ores - https://phabricator.wikimedia.org/T364089#9816947 (10calbon) a:03isarantopoulos [14:47:16] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass the maximum number of uploads to the logo detection service - https://phabricator.wikimedia.org/T363505#9816954 (10calbon) a:03kevinbazira [14:49:31] 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9816978 (10calbon) People can now pip install and use models. Right now we only have a few models - the number of models should increase over time. [15:15:14] elukey: are we ok to proceed with building the image? https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1032725 [15:16:06] isaranto: I think so yes [16:19:36] 06Machine-Learning-Team: Have problem with migrating to LiftWing from ores - https://phabricator.wikimedia.org/T364089#9817716 (10isarantopoulos) @AgnesAbah have you managed to resolve the issue? As Kosta mentioned there isn't anything there related to Lift Wing but with the MediaWiki Action API. [16:19:48] logging off folks, have a nice evening /rest of day! [16:21:14] \o [16:49:30] bye Ilias! [19:14:38] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9818646 (10Dzahn) p:05Triage→03Low [19:15:07] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9818647 (10Dzahn) a:03Jhancock.wm