[05:34:32] Good morning! [05:38:21] natematias: happy that his works! the proper reference for the model cards is the link on meta wiki https://meta.wikimedia.org/wiki/Machine_learning_models [08:18:06] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638#9807863 (10JMeybohm) [08:32:24] morning! [08:46:18] Hey Aiko o/ [08:56:38] friyayy [09:01:45] 06Machine-Learning-Team: Investigate a way to return other 2xx status code from predict in kserve - https://phabricator.wikimedia.org/T365226 (10achou) 03NEW [09:04:04] TGIF :D [09:20:04] I'm going to start counting the times I mispell pytorch and type pytroch [09:28:12] I'd like a review here whenerver possible https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1032517 [09:28:43] not urgent. Luca already manually applied these changes yesterday so there shouldn't be any diff or deployment required, [09:29:59] +1 [09:34:32] Danke! [09:45:14] seems like the changes weren't there for revscoring-editquality-reverted so I just deployed them to eqiad and codfw [10:33:40] * isaranto lunch o clock [10:49:38] * aiko lunch too! [12:14:15] (03PS1) 10Kevin Bazira: article-descriptions: refactor error messages to avoid repetition [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032731 [12:19:06] kevinbazira: o/ shall we deploy the latest changes in logo detection and test them? [12:20:28] isaranto: o/ sure, let me prepare a patch for experimental ns [12:21:40] hello folks! [12:22:02] hello Luca! [12:23:18] sooo in order to fix ml-staging2001 I'd need to reimage it, installing Bookworm and basically taking it offline for a bit.. it may stay down if something goes wrong, this will be the first attempt to have a k8s worker on bookworm [12:23:49] now the main issue is that we'll probably not be able to deploy anything in staging since there won't be enough capacity left :( [12:24:28] we'll be fine with that. It is what we gotta do! [12:25:03] do you guys want to test logo detection first? [12:25:41] yes we can put a pause after testing logo detection [12:25:51] aiko: does that work for you as well? [12:27:35] yes, not a problem! I'm not testing anything in staging [12:28:04] I'm facing some issues with docker-pkg and production images. I'm doing a full reboot to see if it helps [12:34:25] 06Machine-Learning-Team, 13Patch-For-Review: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9808420 (10elukey) >>! In T363191#9805400, @elukey wrote: > In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test i... [12:35:25] ah no sorry I will not be able to do it today, we are missing a lot of packages sigh [12:38:37] ack! [12:40:12] elukey: o/ [12:40:47] isaranto: patch is ready: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1021402 [12:40:47] please review whenever you get a minute. thanks! [12:41:08] done! [12:41:12] super fast. thanks :) [12:53:08] can someone try the `docker-pkg` command to tell me if everything works for you? `docker-pkg -c config.yaml build images/` [12:54:28] I'm getting errors like these https://phabricator.wikimedia.org/P62580 in every possible command I try. And unfortunately it seems to be some issue local docker daemon. Perhaps after a previous upgrade (?) [12:56:40] isaranto: the logo-detection model-server is live on LiftWing staging, here are the results of a test request I made: https://phabricator.wikimedia.org/P62581 [12:57:01] nice work! [13:00:32] isaranto: is docker running? It seems trying to connect to its unix socket [13:02:45] elukey: yes it is running "properly" . I mean that everything works except for the connection from python/docker-pkg [13:03:46] I'll look into it a bit deeper. I've done a couple of updates to docker recently [13:04:46] ack [13:04:49] I also opened https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071269 [13:07:45] 06Machine-Learning-Team: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246 (10isarantopoulos) 03NEW [13:07:55] 06Machine-Learning-Team: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) - https://phabricator.wikimedia.org/T365246#9808547 (10isarantopoulos) a:03isarantopoulos [13:09:32] awesome, thanks! [13:11:16] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9808586 (10elukey) Upgrading to Bookworm is not straightforward since multiple packages need to be built etc.., so I filed a bug report to Debian while we wait: https://bugs.debian.org/cgi-bi... [13:11:24] (03PS1) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [13:12:10] (03PS2) 10Ilias Sarantopoulos: huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) [13:12:30] (03CR) 10CI reject: [V:04-1] huggingface: upgrade kserve to 0.13-rc0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [13:13:13] (03CR) 10Ilias Sarantopoulos: "The image has been tested with Bookworm and works as expected." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1032777 (https://phabricator.wikimedia.org/T365246) (owner: 10Ilias Sarantopoulos) [13:19:20] isaranto: o/ I remember I encountered the problem of docker-pkg before [13:24:49] https://forums.docker.com/t/docker-errors-dockerexception-error-while-fetching-server-api-version-connection-aborted-filenotfounderror-2-no-such-file-or-directory-error-in-python/135637/7 [13:25:04] not sure if it is the same problem, but it was solved by 1) setting "Allow the default Docker socket to be used" under advanced settings in docker desktop and 2) setting DOCKER_HOST [13:26:00] aiko: ahhh yes I recall! [13:26:35] it is probably it, because IIRC the docker unix socket wasn't in the place docker-pkg expected it [13:26:38] isaranto: ---^ [13:26:50] thank you both! [13:28:05] 06Machine-Learning-Team, 13Patch-For-Review: Set automatically libomp's num threads when using Pytorch - https://phabricator.wikimedia.org/T360111#9808671 (10elukey) The new endpoint has been rolled out as part of the migration to the mw-int-ro endpoint, task done! [13:28:37] 06Machine-Learning-Team, 13Patch-For-Review: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9808672 (10elukey) a:03elukey [13:28:49] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9808677 (10elukey) a:03elukey [13:32:00] because luca your first reaction was the same haha I remember you also asked me if my docker is up [13:33:13] Good morning all [13:33:39] morning Chris! [13:33:44] o/ [13:34:43] hi Chris o/ [13:36:38] solved it! So after some docker updates the following setting was disabled -> Docker settings -> Advanced -> Allow the default Docker socket to be used [13:49:42] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253 (10elukey) 03NEW [13:50:37] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9808792 (10elukey) [13:52:16] new version of base torch image is 15.9GB vs 10.x GB of previous one [13:54:19] * elukey sigh [13:54:36] compressed should be ok right? I recall that for the llm image we were good [13:58:45] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9808809 (10JMeybohm) For {T362408} we're planning to backport containerd from bookworm to bullseye. Maybe it would be feasible to backport runc as well (althoug... [14:00:21] hopefully yes. testing now [14:03:22] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9808825 (10elukey) ML would be very happy to test the 6.x kernel since the GPU drivers are shipped directly with it, so we'd get a nice bump to those as well. I... [14:10:24] ouch! compressed image is 4.86GB [14:10:56] I'm going to see what size were the other base images when compressed [14:17:29] I'm logging off for the weekend folks. Have a nice rest of day+ weekend [14:18:18] o/ [14:28:55] night Isaranto! [15:16:08] folks I found that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018993 was not deployed when I moved all our isvcs to mw-api-ro-int [15:16:20] it still works, but the ServiceOps team is reducing capacity for api-ro [15:16:52] mmm wait a min, how come that it works? [15:18:19] yeah it doesn't [15:18:28] but at this point we don't have it in our httpbb tests [15:21:23] okok it does work, but it is weird why [15:21:58] I am taking the responsibility to deploy on a friday, not clear why it works but it is brittle [15:23:14] done! [15:23:41] sorry for the issue, the isvcs were working but with a weird istio config, so now all is good [16:01:52] np! but we have it in our httpbb tests, right? [16:03:51] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/httpbb/liftwing/test_liftwing_production.yaml#176 [16:04:51] aiko: we do yes! api-ro.wikimedia.org was configured and somehow it was still working [16:05:04] so I changed the istio config and now it uses mw-api-ro-int properly [16:09:02] I see [16:43:49] going afk for the weekend folks! [16:44:01] have a nice rest of the day and weekend o/ [16:50:29] o/ bye Luca [17:01:16] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9809590 (10Isaac) @ngkountas I think you're right. Let me know if you have a task for making the switch because once you all have completed the transfer an... [17:24:12] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9809677 (10Trizek-WMF) [17:24:17] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9809678 (10Trizek-WMF) [17:24:23] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9809674 (10Trizek-WMF) We aren't ready to run the script to populate the suggestions. given our backlog, we are moving this task to our... [18:49:22] logging off! have a nice weekend :) [19:40:16] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291 (10ssingh) 03NEW [19:44:42] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9810066 (10ssingh) Host is depooled: ` 19:38:37 <+logmsgbot> !log dzahn@cumin1002 conftool action : set/pooled=no; selector: name=ml-serve2002.codfw.wmnet ` [23:47:04] * natematias will be leaving shortly, since all my questions have been answered thanks to the generosity of folks in this room. Thank you so much— and thanks for your contributions to machine learning across things that the Wikimedia Foundation manages and maintains. I expect it’s the kind of thing that is largely thankless unless something is broken or someone has an unformed demand, perhaps especially at this moment of AI hype. So I want you to [23:47:05] * natematias I see and value what you do and how important it is across the movement <3