[07:58:13] good morning! [08:11:56] good morning folks [09:37:20] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10495938 (10dcausse) >>! In T382295#10489855, @isarantopoulos wrote: > - According to the [[ https://gerrit.wikimedia.org... [09:56:47] I'm having some permission issues with a directory in the hf_cache in ml-lab [09:58:42] kevinbazira: could you delete the dir `/srv/hf-cache/hub/datasets--allenai--c4`. I see you are the owner and I'd like to re-download [10:00:42] isaranto: o/ I've deleted `/srv/hf-cache/hub/datasets--allenai--c4`. [10:00:50] thank you let me try again [10:01:12] šŸ‘ [10:08:43] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Create event stream for article-country model-server hosted on LiftWing - https://phabricator.wikimedia.org/T382295#10496113 (10dcausse) >>! In T382295#10490807, @Isaac wrote: > Thanks all for working this out! I know a lot of moving par... [10:27:17] Morning! [10:40:28] hi Tobias! [11:46:17] (03PS1) 10Kevin Bazira: events: add support for the weighted tags event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) [12:10:54] (03CR) 10Kevin Bazira: "This patch has been tested, and its inputs/outputs can be seen here: https://phabricator.wikimedia.org/P72464" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [12:44:58] my patch (fix) for GPTQModel already made its way to a new release :D [12:45:00] https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4 [12:47:19] for once, "upstream makes releases all the time" works in our favor :) [13:09:20] klausman: o/ shall we merge and test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1114012 in staging? [13:09:50] and afterwards knative, but we can keep things separated [13:21:01] seems like my work on ml-lab ate all the disk space for now [13:38:52] klausman: when you have some time could u please delete these dirs from /srv/hf_cache/hub? [13:39:01] ``` [13:39:01] models--google--flan-t5-xl [13:39:01] models--Orion-zhen--aya-expanse-32b-AWQ [13:39:01] models--TheBloke--zephyr-7B-beta-AWQ [13:39:01] models--TheBloke--Mistral-7B-v0.1-AWQ [13:39:02] models--TheBloke--Mistral-7B-Instruct-v0.1-AWQ [13:39:02] models--CohereForAI--aya-23-8B [13:39:03] ``` [13:59:40] (03CR) 10DCausse: "thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114355 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [14:32:45] * isaranto going afk for an errand - bbl [15:14:18] 06Machine-Learning-Team, 13Patch-For-Review: Issues with Reference Need and Reference Risk models - https://phabricator.wikimedia.org/T384172#10497361 (10achou) @MunizaA thanks for taking care of this. :) To download the feature.db from Research's swift, I used this url: `bash curl ' 06Machine-Learning-Team, 10ORES, 10Edit-Review-Improvements-RC-Page, 06Growth-Team, 07Regression: [regression-wmf.20] Recent changes filters disappear from the menu - https://phabricator.wikimedia.org/T290113#10497418 (10Aklapper) @Etonkovidova Does that mean this task should be closed? [15:49:25] isaranto: done! (sorry for late response) [15:49:51] thank you! [15:50:26] elukey: yes, we can do that now (or tomorrow, if you prefer) [15:50:32] phew , that cleared 60G [15:51:23] I wish there was an easy way to delete unused models from the shared cache, but I don't see a way of doing so [15:52:54] klausman: yes I have time, you can go ahead merging and deploying kserve to staging if you are not busy (or I'll do it no issue) [15:53:13] will do in ~5m [16:01:19] 06Machine-Learning-Team, 10ORES, 10Edit-Review-Improvements-RC-Page, 06Growth-Team, 07Regression: [regression-wmf.20] Recent changes filters disappear from the menu - https://phabricator.wikimedia.org/T290113#10497525 (10matmarex) This has always been intermittentā€¦ So I'm not sure if we can confidently s... [16:04:20] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [onboarding] Update revertrisk to kserve 0.14.1 - https://phabricator.wikimedia.org/T383119#10497549 (10gkyziridis) I think there are some issues with the memory allocation in some of the pods that handling the CI/CD of the [[ https://gitlab.wikimedia... [16:04:57] 10Lift-Wing, 06Machine-Learning-Team: Quantize aya-expanse-32B with GPTQ - https://phabricator.wikimedia.org/T384734#10497550 (10isarantopoulos) The above fix has been included in the latest release for [[ https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4 | GTPQModel (v1.7.4) ]] I quantized aya-expan... [16:08:43] elukey: I'm getting a templating error (pastebinned in a sec) [16:09:57] https://phabricator.wikimedia.org/P72500 [16:11:19] klausman: checking [16:11:32] I suspect it's independent of this particular change, but somewhere a template changed and our values.yaml was never updated [16:12:06] yeah I have the same impression [16:12:32] but CI didn't catch it that is weird [16:12:43] the ml_serv values.yaml has that field, so it's just missing in staging [16:12:54] poor staging [16:13:02] I'll make a patch [16:18:51] elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1114414 [16:22:09] checking [16:23:31] I have a doubt though, on ml-staging [16:23:48] I don't recall where this param is used [16:23:57] the docstring says "that share some config (values) in "admin_ng/values/.yaml"" [16:24:07] but ml-staging is not there, we have ml-staging-codfw [16:24:20] that is a dir and not a yaml file though [16:24:29] I checked what we do for the wikikube's staging clusters [16:24:33] but I don't find the values [16:28:30] checking something [16:29:15] ahhhh no wait! [16:29:28] klausman: "cd /srv/deployment-charts/helmfile.d/admin_ng/kserve/" vs "cd /srv/deployment-charts/helmfile.d/admin_ng" [16:29:44] aaah [16:29:51] if you want to isolate kserve with helmfile you need to use -l name=kserve [16:31:19] ok, that works (checking the diff over for a sec) [16:32:06] Ok, applying [16:32:30] and done, no errors [16:33:03] Shall I bounce one or three services to see if they can restart correctly? [16:33:33] let's do it just to be sure [16:33:43] ack [16:34:18] bounced revertrisk, no trouble at all. [16:35:01] also bouncing a revscoring one, for completeness [16:35:53] also fine. running httpbb against them as a final check [16:39:03] all goodf [16:39:17] all right let's do knative [16:40:02] +2'd, waiting for merge [16:46:32] hmmm. I see no diff [16:49:47] is git up to date with the last commit? [16:50:37] I see the diffs [16:51:45] ok, I must be doing sth wrong, let me check [16:54:18] weird. I checked whether git was up to date (and it was, and ran helmfile -e ml-staging-codfw -l name=knative-serving diff --context=3, but got no diff. Now I do. [16:54:37] Alright, apply ing [16:57:14] and done. doing the bounce-and-httpbb dance again (with different services [17:00:51] all good! [17:03:23] elukey: thanks for the help. I'll push this to eqiad tomorrow, let ti soak for a day, then do codfw on wednesday [17:05:19] klausman: ack, I just expanded the filtering in https://logstash.wikimedia.org/app/discover#/view/7f276c90-f8a0-11ee-be54-8fd74c07934f?_g=h@b8b0449&_a=h@c3b7e3a to include ml-staging-codfw [17:05:28] ml-serve prod is also missing, lemme add them [17:06:32] report in https://phabricator.wikimedia.org/T369493#10497950 of the first verification step [17:11:50] is the intent of the first command to trigger a violation so it shwos up in the log? [17:12:08] or is the second command supposed to do that? [17:12:38] ah no, the second one is only a list [17:13:22] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1114423 as segway [17:13:56] (not even a word, anyway, follow up :D) [17:14:13] segway is a word. trademark, even :) [17:15:00] all right I thought it was something mispelled (I knew only the trademark :D) [17:15:11] oook so let's restart tomorrow with staging [17:15:11] the word it's alluding to is segue [17:15:25] and yes if you could rollout prod etc.. super [17:15:30] will do [17:15:35] I don't see any violation so far [17:16:04] https://en.wiktionary.org/wiki/segue it's a word English borrowed from Italian, so I can see how you arrived at that :) [17:16:28] TIL [17:16:37] going afk, have a nice rest of the day folks! [17:18:38] \o [21:58:23] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [SPIKE] How could we add topic filtering to Recent Changes? [8H] - https://phabricator.wikimedia.org/T381569#10499024 (10Kgraessle) #### Should we use OR...