[08:10:23] Good morning! [09:32:00] morning! [09:39:22] o/ [10:03:05] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664492 (10kevinbazira) Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes locally and got results that I shared in P589... [10:05:28] Morning! [10:06:34] \o [10:08:00] I'll merge a part of the Cassie-Istio polcies in a bit It shouldn't affect anythiong in prod (it's only for staging and the experimental NSes), so I expect no disruption. [10:08:16] ack [10:13:45] I have the following issue: whenever I rebuild a docker image with a pytorch dependency docker layer caching doesnt work for the pip install layer even if I don't change anything in the requirement. This results in redownloading and installing pytorch-rocm which is 1.4GB. [10:14:44] luckily ofc the download part is collected from the pip cache but still it takes a lot of time. My bet is that something in the pytorch index changes (the extra-index-url) so the layer is not cached [10:14:58] is anyone experiencing the same thing or is it just me? [10:18:27] I haven't tried a build yet, but I might after the Istio thing. [10:20:03] no need to waste your time. I was asking for aiko or Luca that have built the pytorch images already [10:20:26] thanks though! It is not blocking me or anything I was just curious since it is a bit annoying [10:26:25] It should definitely be either fixed are at least clear why it is happening [10:26:52] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664768 (10kevinbazira) >>! In T358676#9662292, @mfossati wrote: > @kevinbazira: I'm hitting this ignored exception when running the code: > ` > Exception ignored in: <... [10:27:30] Welp, the istio change doesn't work, time to revert it [10:31:25] btw here's a new Greek LLM https://huggingface.co/ilsp/Meltemi-7B-v1 [10:32:27] isaranto: o/ I didn't experience the same thing I think, but the pytorch for rr-ml is installed via KI [10:32:45] ok, thanks! [10:34:50] (03PS20) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986) [11:00:43] isaranto: I tried rebuilding a docker image with torch installed via requirement.txt and didn't have the same issue. the second time I built it, the cache was used [11:06:31] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9664867 (10isarantopoulos) We'll be using the pytorch rocm image based on debian bookworm for this image (see [[ https://phabricator.wikimedia.org/T360638 | #T360638 ]]) Also... [11:06:42] aiko: ok , thanks a lot for testing it! [11:07:11] hello folks! [11:07:21] I am in a bit earlier since I need to work on the docker registry nodes [11:08:19] isaranto: o/ I didn't get the part of the layer not cached [11:08:39] are you saying that the hf pip install step doesn't find pytorch-rocm in the base image and has to re-download it? [11:11:18] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664877 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=110ad5f3-e41f-4f7d-a5d0-3343dc9fca15) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [11:11:48] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d34de6a-7fb2-4477-984a-7dcc642d43b2) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [11:11:59] elukey: fyi, working with Balthazar on doing the Cassie thing The New And Improved Way [11:12:10] elukey: yep, exactly that. It doesnt happen every time, at really random times. today first time I built it and form then on it found the cache [11:14:41] isaranto: did you see my comments about where we deploy torch in the base image vs in blubber? I was chatting about it yesterday, wondering if it is the case [11:15:03] klausman: yep I saw it, I asked Bathazar to wait a sec [11:16:10] yes I did, but I'm still not using the torch base image [11:17:05] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664883 (10ops-monitoring-bot) VM registry2003.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [11:20:15] isaranto: ahhhhh okok [11:20:17] elukey: we're both in lunch-mode [11:20:20] sorry totally misunderstood [11:20:22] klausman: <3 [11:20:24] elukey: why the wait? [11:20:41] precaution, I am working on the docker registry [11:20:48] ah, righto. [11:21:53] ack! [11:25:37] hi luca! [11:25:55] * aiko lunch :D [11:33:38] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664919 (10ops-monitoring-bot) VM registry2004.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM [11:45:54] docker registry updated!!! \o/ [11:46:02] hi aiko :) [11:46:03] nice! [11:50:53] Niiice [11:50:57] * isaranto lunch! [11:59:39] 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664973 (10mfossati) >>! In T358676#9664492, @kevinbazira wrote: > Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes lo... [12:05:12] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: 14Bump memory for registry[12]00[34] VMs - 14https://phabricator.wikimedia.org/T360637#9664981 (10elukey) 05Open→03Resolved 14Everything done! [12:15:02] * elukey lunch! [12:20:31] elukey: are Balthazar and I ok to continue with the Istio stuff? [12:37:55] yep! [12:37:57] all done now [12:41:03] excellent [12:41:15] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015029 is my first stab at using what Balthazar and I have built so far. [12:41:30] and CI bombed of course [13:54:39] nice! [13:55:00] I left a comment since one thing may not work, namely the labels selectors used [13:55:29] afaics we may need to change Balthazar's template to allow to set app-wmf instead of app [13:55:33] should be easy enough [13:55:49] yeah, we have our own instance of it anyway [13:56:22] in theory no, the template is managed by sextant [13:57:28] I was thinking the meta template at line 23 [13:57:28] we inject it in our network policy template, and that is ok, but Bathazar's template should be the same one that is present in "modules" [13:58:26] the issue should be at line 23 of external-services-networkpolicy_1.0.0.tpl [13:58:45] selector: "app == '{{ template "base.name.chart" $ }}' && release == '{{ $.Release.Name }}'" [13:58:52] Mh, I see [13:59:36] what I'd do is to file a patch to add version 1.1.0 or 1.0.1 of the template, and then update your patch [13:59:46] ack [14:00:04] lemme know if it makes sense, it wasn't a "this is the way", I am thinking out loud :) [14:00:19] in the current way I am almost sure that we wouldn't target the right pods [14:00:45] I am a bit ignorant about how sextant and the templates fit together (and why we have to have a copy of the modules/ stuff) [14:41:36] my understanding of the whole thing is the following: [14:42:06] - deployment-chart's modules is the canonical version of the templates, basically a sort-of repository [14:42:33] - in every chart we specify (via package.json) what version of those templates the chart will use [14:44:27] - sextant is a handy tool that copies the right version of every template from "modules" to a chart, based on what package.json states [14:45:23] so every chart has it own copy of the templates, but those shouldn't be changed otherwise when "modules" gets updated it may become a problem [14:45:29] lemme know if it makes sense [14:46:13] Sortof. Why is there still a vendor/... copy if Sextant does this? [14:46:44] (also, unrelatedly, Balthazar has suggested pinging Janis for input, which I will do) [14:46:51] s/un// [14:47:41] the vendor copy is what sextant manages [14:48:19] But we still have to copy it there ourselves? [14:48:56] so sextant copies the files, you just need to file the patch after sextant runs [14:49:18] I see. Is this documented anywhere? I found nothing on Wikitech [14:49:47] the use case of a single template may not tell how many things sextant does, because it is easy.. but sometimes you have use cases like "for this chart, I want the latest version of all modules that I use under vendor" [14:50:08] and a template version of module X can have a dependency on module Y [14:50:36] so instead of reading a ton of yaml, you can just use sextant [14:51:01] https://gitlab.wikimedia.org/repos/sre/sextant has a very nice README.md [14:51:08] So what is the usual workflow? Make a patch with package.lock/json updated and then? [14:51:28] it depends on the use case [14:51:59] but in general if you need to update a chart's dependency, or multiple ones, you can just check the README for the use case [14:52:24] see https://gitlab.wikimedia.org/repos/sre/sextant#update-to-a-new-minormajor-version [14:52:46] package.lock/json is managed by sextant as well [14:52:51] Oh there is an entirely separate repo! [14:52:59] basically you run the command, add the files to git and file a patch [14:53:06] yeah sextant is a tool [14:53:45] that I knew, but I had no idea it was a WMF one [14:54:43] SO I pinged Janis about the change and what the most idiomatic/WMF-style way forward would be. [14:55:13] Balthazar thinks my proposed change might work, but wasn't sure about said idomaticness [14:58:03] not sure if I follow, we just want to have a way to specify the selectors right? [14:59:14] basically, do we just add app-wmf to the selector line? or is there a better way? [14:59:50] app-wmf is something that we had to add since app was automatically set by kserve IIRC [15:00:16] I think you'd have to invent something to override the selector for ml use-cases [15:00:26] so we could add a OR operator, but we could also set an option to specify what label to use, by default "app" [15:00:36] jayme: yeah +1 [15:01:46] or just fix https://phabricator.wikimedia.org/T253395 fleet wide :) [15:02:27] "just" [15:02:31] only opened it 4 years ago, still a young task :D [15:02:38] or as we Germans call it: "mal eben schnell" [15:03:11] klausman: so the template should have a tunable to specify something different than "app", and we default to "app" in the template [15:03:14] that should do the trick [15:03:33] and also help desperate souls in the future with the same problem [15:04:00] I have no idea how to make a tunable for the template [15:04:21] I bet there will be no other souls as desperate as the ml ones :) [15:04:34] jayme: how dare you :D [15:04:44] ML leads the way! Wait. [15:05:06] klausman: there are a lot of other templates using something similar, you can check in deployment-charts [15:05:44] 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9665792 (10elukey) To keep archives happy: * Me and Aiko tested the Revert Risk ML Docker image using Pytorch's base image and ran it locally, it worked fine! * The new image was pushed to the... [15:07:39] elukey: I just have no idea about the supposed syntax [15:08:40] 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9665799 (10elukey) Use case to test: * Blubber model server using the Pytorch base image * torch stated in one of the model server's requirements.txt files (same version and a different one).... [15:09:42] klausman: yeah I don't recall the syntax either 99% of the time (only Janis is able to remember it and make it right at first try probably) [15:09:56] this is why I suggested to check other examples in deployment-charts templates [15:10:08] but I don't even know what I am looking for. [15:12:01] so the idea is to be able to customize the hardcoded "app" [15:12:03] selector: "app == '{{ template "base.name.chart" $ }}' && release == '{{ $.Release.Name }}'" [15:12:08] yes [15:12:35] and I suspect this is already a parameterized call: [15:12:36] {{ template "base.networkpolicy.egress.external-services" $ }} [15:12:41] $ being the parameter [15:13:25] there are a ton of templates rendered, you just need to find the most suited for the job [15:13:41] hint: in the template that Balthazar created there is $.Values.external_services mentioned [15:14:00] I would start from there [15:15:48] (for example, {{ $serviceType }} is rendered from there etc..) [15:19:12] So adding a new line to charts/kserve-inference/values.yaml (e.g. appname: "wmf-app") and the use $.Values.appname instead of "app"? [15:19:35] modulo better names isntead of "appname", but [15:21:39] what I don't know if variables with dashes are rallowed at all, so even if the resultant policy has `selector: "app-wmf == 'kserve-inference' && release == 'main'"`, it might be a syntax error (the same way how you can't have a var named `app-wmf` in Python) [15:25:01] 10Lift-Wing, 06Machine-Learning-Team: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117 (10isarantopoulos) 03NEW [15:26:54] klausman: what I'd do is to have a module-specific appname label, instead of using something generic in $.Values [15:28:01] we don't have fixtures for modules sadly, but one thing that you can do to test (at least I did it in the past) is to change the module locally, update the chart and render it via helm template or local CI [15:28:10] (using .fixture) [15:28:14] Yeah, thats' what I am doing. [15:28:23] perfect [15:28:32] rake run_locally['default'] specifically [15:29:25] I got it working with the toplevel name in values, but I am a bit unsure what you mean by module-specific name. Something like extservices_netpolicy_appname? [15:32:22] (03PS1) 10Elukey: python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070 [15:32:34] I can't really use a member of external_services, since the template uses all of them [15:33:21] yes something like that [15:33:44] external_services_appname: wmf-app [15:33:46] external_services: [15:33:48] cassandra: [15:33:50] - ml-cassandra [15:33:55] (in helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml) [15:34:07] And then using that in the template, or if unset, default to "app" [15:34:08] external_services could be modified to have all services under a more nested structure, and add "appname: to it [15:34:22] but external_services_appname may also be fine [15:34:36] The downside to the nested approach is that I would have to touch every chart using it [15:34:48] true yes [15:35:01] Future Work™ [15:35:47] Checking whether the template output is still right... [15:49:00] Ok, I think the change is ready for rerereview :) [15:50:15] checking [15:57:10] what reviewers do you suggest for the split-out change? [15:58:16] Janis and Balthazar for the module change, the upgrade of the chart could be ours + Balthazar in CC if you want [15:59:09] klausman: from previous chats with Janis, it is best to proceed in this way for modules changes [15:59:27] 1) first patch to just add the new file as copy of the last version [15:59:38] 2) second patch that modifies the bit that you want to change [15:59:46] so it is easy to review [16:00:15] then the third patch is the one that you are working on [16:00:36] With review turnaround that will take a while [16:00:55] what do you mean? [16:02:44] every new patch is at least 5-10m of turnaround until it is reviewed. If I am unlucky, maybe a day [16:04:09] isaranto, kevinbazira: about locust, I know why it's not working. we didn’t add host header for revertrisk model. probably because we used api-gw before and forgot to add it when we changed to staging [16:04:19] I can let you send patches to serviceops without me reviewing, I can assure that they will ask for the same stuff, I am not suggesting anything that I haven't already experienced in the past myself. If you feel that I am delaying you please go ahead and ask other SREs :) [16:04:23] klausman: --^ [16:04:27] I'll file a patch to fix it [16:05:50] ok! nice work aiko. At least this makes sense! [16:05:58] I just feel it's a bit of a waste of other's time to have both Janis and Balthazar review the "I copied a file to 1.0.1" change [16:06:48] isaranto: thanks for mentioning the problem might be the host header :D [16:07:19] klausman: if you don't believe me, please check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/912850 [16:07:29] I believe you. [16:08:20] and it is not a waste of time, since the more people know how serviceops (and hence, the defacto k8s reviewers) prefers to see patches filed the more this knowledge is widespread [16:19:31] I created the patch to remove some deployments from ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015080 [16:41:51] klausman: I think you're missing a git add https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015029/15 [16:43:59] Hm. I thought I'd rebased correctly but I must have messed something up [16:45:28] Aaah, I didn't redo the copy of 1.0.1 [16:59:46] elukey: final change in the 3-split is ready now [17:38:52] I found why I ended up with a 6GB image in pytorch + rocm. It was because there was another dependency from hugginface upstream for torch and that one installed the cpu version after I installed the rocm one [17:39:36] so perhaps that kind of answers what we were wondering about having a torch requirement in the pytorch base image (although it will be good to test it separately) [17:48:53] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9666520 (10isarantopoulos) After thinking about this and trying various things out (copying code or using a specific commit) I found the following 2 issues we need to resolve:... [17:50:17] isaranto: yeah I have done some tests with RR-ML, if I set the same torch dep of the base image in requirements.txt it pip tries to re-install it [17:50:45] so I think that for the pip purposes, we'll need to have stuff installed/linked under /opt to make blubber works [17:51:10] i need to check more in depth what blubber does, there may be an env var to set to force it to use other dirs [17:51:16] I'll work on it tomorrow :) [17:51:23] have a nice rest of the day folks! [17:52:19] logging off as well o/ [17:53:20] I will un-deploy the revscoring models from staging tomorrow [17:53:35] I'm logging off too o/ [18:57:39] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666775 (10mfossati) [18:59:05] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666776 (10mfossati) 05Open→03In progress [18:59:24] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666778 (10mfossati)