[08:10:23] <isaranto>	 Good morning!
[09:32:00] <aiko>	 morning!
[09:39:22] <isaranto>	 o/
[10:03:05] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664492 (10kevinbazira) Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes locally and got results that I shared in P589...
[10:05:28] <klausman>	 Morning!
[10:06:34] <isaranto>	 \o
[10:08:00] <klausman>	 I'll merge a part of the Cassie-Istio polcies in a bit It shouldn't affect anythiong in prod (it's only for staging and the experimental NSes), so I expect no disruption.
[10:08:16] <isaranto>	 ack
[10:13:45] <isaranto>	 I have the following issue: whenever I rebuild a docker image with a pytorch dependency docker layer caching doesnt work for the pip install layer even if I don't change anything in the requirement. This results in redownloading and installing pytorch-rocm which is 1.4GB. 
[10:14:44] <isaranto>	 luckily ofc the download part is collected from the pip cache but still it takes a lot of time. My bet is that something in the pytorch index changes (the extra-index-url) so the layer is not cached
[10:14:58] <isaranto>	 is anyone experiencing the same thing or is it just me?
[10:18:27] <klausman>	 I haven't tried a build yet, but I might after the Istio thing.
[10:20:03] <isaranto>	 no need to waste your time. I was asking for aiko or Luca that have built the pytorch images already
[10:20:26] <isaranto>	 thanks though! It is not blocking me or anything I was just curious since it is a bit annoying
[10:26:25] <klausman>	 It should definitely be either fixed are at least clear why it is happening
[10:26:52] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664768 (10kevinbazira) >>! In T358676#9662292, @mfossati wrote: > @kevinbazira: I'm hitting this ignored exception when running the code: > ` > Exception ignored in: <...
[10:27:30] <klausman>	 Welp, the istio change doesn't work, time to revert it
[10:31:25] <isaranto>	 btw here's a new Greek LLM https://huggingface.co/ilsp/Meltemi-7B-v1
[10:32:27] <aiko>	 isaranto: o/ I didn't experience the same thing I think, but the pytorch for rr-ml is installed via KI 
[10:32:45] <isaranto>	 ok, thanks!
[10:34:50] <wikibugs>	 (03PS20) 10Ilias Sarantopoulos: huggingface: add huggingface image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1009783 (https://phabricator.wikimedia.org/T357986)
[11:00:43] <aiko>	 isaranto: I tried rebuilding a docker image with torch installed via requirement.txt and didn't have the same issue. the second time I built it, the cache was used
[11:06:31] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9664867 (10isarantopoulos) We'll be using the pytorch rocm image based on debian bookworm for this image (see [[ https://phabricator.wikimedia.org/T360638 | #T360638 ]]) Also...
[11:06:42] <isaranto>	 aiko: ok , thanks a lot for testing it!
[11:07:11] <elukey>	 hello folks!
[11:07:21] <elukey>	 I am in a bit earlier since I need to work on the docker registry nodes
[11:08:19] <elukey>	 isaranto: o/ I didn't get the part of the layer not cached
[11:08:39] <elukey>	 are you saying that the hf pip install step doesn't find pytorch-rocm in the base image and has to re-download it?
[11:11:18] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664877 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=110ad5f3-e41f-4f7d-a5d0-3343dc9fca15) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and...
[11:11:48] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664878 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d34de6a-7fb2-4477-984a-7dcc642d43b2) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and...
[11:11:59] <klausman>	 elukey: fyi, working with Balthazar on doing the Cassie thing The New And Improved Way
[11:12:10] <isaranto>	 elukey: yep, exactly that. It doesnt happen every time, at really random times. today first time I built it and form then on it found the cache
[11:14:41] <elukey>	 isaranto: did you see my comments about where we deploy torch in the base image vs in blubber? I was chatting about it yesterday, wondering if it is the case
[11:15:03] <elukey>	 klausman: yep I saw it, I asked Bathazar to wait a sec
[11:16:10] <isaranto>	 yes I did, but I'm still not using the torch base image 
[11:17:05] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664883 (10ops-monitoring-bot) VM registry2003.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM
[11:20:15] <elukey>	 isaranto: ahhhhh okok
[11:20:17] <klausman>	 elukey: we're both in lunch-mode
[11:20:20] <elukey>	 sorry totally misunderstood
[11:20:22] <elukey>	 klausman: <3
[11:20:24] <klausman>	 elukey: why the wait?
[11:20:41] <elukey>	 precaution, I am working on the docker registry
[11:20:48] <klausman>	 ah, righto.
[11:21:53] <isaranto>	 ack!
[11:25:37] <aiko>	 hi luca!
[11:25:55] * aiko lunch :D
[11:33:38] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9664919 (10ops-monitoring-bot) VM registry2004.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase VRAM
[11:45:54] <elukey>	 docker registry updated!!! \o/
[11:46:02] <elukey>	 hi aiko :)
[11:46:03] <klausman>	 nice!
[11:50:53] <isaranto>	 Niiice
[11:50:57] * isaranto lunch!
[11:59:39] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9664973 (10mfossati) >>! In T358676#9664492, @kevinbazira wrote: > Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes lo...
[12:05:12] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: 14Bump memory for registry[12]00[34] VMs - 14https://phabricator.wikimedia.org/T360637#9664981 (10elukey) 05Open→03Resolved 14Everything done!
[12:15:02] * elukey lunch!
[12:20:31] <klausman>	 elukey: are Balthazar and I ok to continue with the Istio stuff?
[12:37:55] <elukey>	 yep!
[12:37:57] <elukey>	 all done now
[12:41:03] <klausman>	 excellent
[12:41:15] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015029 is my first stab at using what Balthazar and I have built so far.
[12:41:30] <klausman>	 and CI bombed of course
[13:54:39] <elukey>	 nice!
[13:55:00] <elukey>	 I left a comment since one thing may not work, namely the labels selectors used
[13:55:29] <elukey>	 afaics we may need to change Balthazar's template to allow to set app-wmf instead of app
[13:55:33] <elukey>	 should be easy enough
[13:55:49] <klausman>	 yeah, we have our own instance of it anyway
[13:56:22] <elukey>	 in theory no, the template is managed by sextant
[13:57:28] <klausman>	 I was thinking the meta template at line 23
[13:57:28] <elukey>	 we inject it in our network policy template, and that is ok, but Bathazar's template should be the same one that is present in "modules"
[13:58:26] <elukey>	 the issue should be at line 23 of external-services-networkpolicy_1.0.0.tpl
[13:58:45] <elukey>	 selector: "app == '{{ template "base.name.chart" $ }}' && release == '{{ $.Release.Name }}'"
[13:58:52] <klausman>	 Mh, I see
[13:59:36] <elukey>	 what I'd do is to file a patch to add version 1.1.0 or 1.0.1 of the template, and then update your patch
[13:59:46] <klausman>	 ack
[14:00:04] <elukey>	 lemme know if it makes sense, it wasn't a "this is the way", I am thinking out loud :)
[14:00:19] <elukey>	 in the current way I am almost sure that we wouldn't target the right pods
[14:00:45] <klausman>	 I am a bit ignorant about how sextant and the templates fit together (and why we have to have a copy of the modules/ stuff)
[14:41:36] <elukey>	 my understanding of the whole thing is the following:
[14:42:06] <elukey>	 - deployment-chart's modules is the canonical version of the templates, basically a sort-of repository
[14:42:33] <elukey>	 - in every chart we specify (via package.json) what version of those templates the chart will use
[14:44:27] <elukey>	 - sextant is a handy tool that copies the right version of every template from "modules" to a chart, based on what package.json states
[14:45:23] <elukey>	 so every chart has it own copy of the templates, but those shouldn't be changed otherwise when "modules" gets updated it may become a problem
[14:45:29] <elukey>	 lemme know if it makes sense
[14:46:13] <klausman>	 Sortof. Why is there still a vendor/... copy if Sextant does this?
[14:46:44] <klausman>	 (also, unrelatedly, Balthazar has suggested pinging Janis for input, which I will do)
[14:46:51] <klausman>	 s/un//
[14:47:41] <elukey>	 the vendor copy is what sextant manages
[14:48:19] <klausman>	 But we still have to copy it there ourselves?
[14:48:56] <elukey>	 so sextant copies the files, you just need to file the patch after sextant runs
[14:49:18] <klausman>	 I see. Is this documented anywhere? I found nothing on Wikitech
[14:49:47] <elukey>	 the use case of a single template may not tell how many things sextant does, because it is easy.. but sometimes you have use cases like "for this chart, I want the latest version of all modules that I use under vendor"
[14:50:08] <elukey>	 and a template version of module X can have a dependency on module Y
[14:50:36] <elukey>	 so instead of reading a ton of yaml, you can just use sextant
[14:51:01] <elukey>	 https://gitlab.wikimedia.org/repos/sre/sextant has a very nice README.md
[14:51:08] <klausman>	 So what is the usual workflow? Make a patch with package.lock/json updated and then?
[14:51:28] <elukey>	 it depends on the use case
[14:51:59] <elukey>	 but in general if you need to update a chart's dependency, or multiple ones, you can just check the README for the use case
[14:52:24] <elukey>	 see https://gitlab.wikimedia.org/repos/sre/sextant#update-to-a-new-minormajor-version
[14:52:46] <elukey>	 package.lock/json is managed by sextant as well
[14:52:51] <klausman>	 Oh there is an entirely separate repo!
[14:52:59] <elukey>	 basically you run the command, add the files to git and file a patch
[14:53:06] <elukey>	 yeah sextant is a tool
[14:53:45] <klausman>	 that I knew, but I had no idea it was a WMF one
[14:54:43] <klausman>	 SO I pinged Janis about the change and what the most idiomatic/WMF-style way forward would be.
[14:55:13] <klausman>	 Balthazar thinks my proposed change might work, but wasn't sure about said idomaticness
[14:58:03] <elukey>	 not sure if I follow, we just want to have a way to specify the selectors right?
[14:59:14] <klausman>	 basically, do we just add app-wmf to the selector line? or is there a better way?
[14:59:50] <elukey>	 app-wmf is something that we had to add since app was automatically set by kserve IIRC
[15:00:16] <jayme>	 I think you'd have to invent something to override the selector for ml use-cases
[15:00:26] <elukey>	 so we could add a OR operator, but we could also set an option to specify what label to use, by default "app"
[15:00:36] <elukey>	 jayme: yeah +1
[15:01:46] <jayme>	 or just fix https://phabricator.wikimedia.org/T253395 fleet wide :)
[15:02:27] <klausman>	 "just"
[15:02:31] <elukey>	 only opened it 4 years ago, still a young task :D
[15:02:38] <klausman>	 or as we Germans call it: "mal eben schnell"
[15:03:11] <elukey>	 klausman: so the template should have a tunable to specify something different than "app", and we default to "app" in the template
[15:03:14] <elukey>	 that should do the trick
[15:03:33] <elukey>	 and also help desperate souls in the future with the same problem
[15:04:00] <klausman>	 I have no idea how to make a tunable for the template
[15:04:21] <jayme>	 I bet there will be no other souls as desperate as the ml ones :)
[15:04:34] <elukey>	 jayme: how dare you :D
[15:04:44] <klausman>	 ML leads the way! Wait.
[15:05:06] <elukey>	 klausman: there are a lot of other templates using something similar, you can check in deployment-charts
[15:05:44] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9665792 (10elukey) To keep archives happy:  * Me and Aiko tested the Revert Risk ML Docker image using Pytorch's base image and ran it locally, it worked fine! * The new image was pushed to the...
[15:07:39] <klausman>	 elukey: I just have no idea about the supposed syntax
[15:08:40] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Create a Pytorch base image - https://phabricator.wikimedia.org/T360638#9665799 (10elukey) Use case to test:  * Blubber model server using the Pytorch base image * torch stated in one of the model server's requirements.txt files (same version and a different one)....
[15:09:42] <elukey>	 klausman: yeah I don't recall the syntax either 99% of the time (only Janis is able to remember it and make it right at first try probably)
[15:09:56] <elukey>	 this is why I suggested to check other examples in deployment-charts templates
[15:10:08] <klausman>	 but I don't even know what I am looking for.
[15:12:01] <elukey>	 so the idea is to be able to customize the hardcoded "app"
[15:12:03] <elukey>	 selector: "app == '{{ template "base.name.chart" $ }}' && release == '{{ $.Release.Name }}'"
[15:12:08] <klausman>	 yes
[15:12:35] <klausman>	 and I suspect this is already a parameterized call:
[15:12:36] <klausman>	 {{ template "base.networkpolicy.egress.external-services" $ }}
[15:12:41] <klausman>	 $ being the parameter
[15:13:25] <elukey>	 there are a ton of templates rendered, you just need to find the most suited for the job
[15:13:41] <elukey>	 hint: in the template that Balthazar created there is $.Values.external_services mentioned
[15:14:00] <elukey>	 I would start from there
[15:15:48] <elukey>	 (for example, {{ $serviceType }} is rendered from there etc..)
[15:19:12] <klausman>	 So adding a new line to charts/kserve-inference/values.yaml (e.g. appname: "wmf-app") and the use $.Values.appname instead of "app"?
[15:19:35] <klausman>	 modulo better names isntead of "appname", but
[15:21:39] <klausman>	 what I don't know if variables with dashes are rallowed at all, so even if the resultant policy has `selector: "app-wmf == 'kserve-inference' && release == 'main'"`, it might be a syntax error (the same way how you can't have a var named `app-wmf` in Python)
[15:25:01] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Remove redundant deployments from ml-staging - https://phabricator.wikimedia.org/T361117 (10isarantopoulos) 03NEW
[15:26:54] <elukey>	 klausman: what I'd do is to have a module-specific appname label, instead of using something generic in $.Values
[15:28:01] <elukey>	 we don't have fixtures for modules sadly, but one thing that you can do to test (at least I did it in the past) is to change the module locally, update the chart and render it via helm template or local CI
[15:28:10] <elukey>	 (using .fixture)
[15:28:14] <klausman>	 Yeah, thats' what I am doing.
[15:28:23] <elukey>	 perfect
[15:28:32] <klausman>	 rake run_locally['default'] specifically
[15:29:25] <klausman>	 I got it working with the toplevel name in values, but I am a bit unsure what you mean by module-specific name. Something like extservices_netpolicy_appname?
[15:32:22] <wikibugs>	 (03PS1) 10Elukey: python: upgrade aiohttp's version to avoid issues with py3.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1015070
[15:32:34] <klausman>	 I can't really use a member of external_services, since the template uses all of them
[15:33:21] <elukey>	 yes something like that
[15:33:44] <klausman>	   external_services_appname: wmf-app
[15:33:46] <klausman>	   external_services:
[15:33:48] <klausman>	     cassandra:
[15:33:50] <klausman>	     - ml-cassandra
[15:33:55] <klausman>	 (in helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml)
[15:34:07] <klausman>	 And then using that in the template, or if unset, default to "app"
[15:34:08] <elukey>	 external_services could be modified to have all services under a more nested structure, and add "appname: to it
[15:34:22] <elukey>	 but external_services_appname may also be fine
[15:34:36] <klausman>	 The downside to the nested approach is that I would have to touch every chart using it
[15:34:48] <elukey>	 true yes
[15:35:01] <klausman>	 Future Work™
[15:35:47] <klausman>	 Checking whether the template output is still right...
[15:49:00] <klausman>	 Ok, I think the change is ready for rerereview :)
[15:50:15] <elukey>	 checking
[15:57:10] <klausman>	 what reviewers do you suggest for the split-out change?
[15:58:16] <elukey>	 Janis and Balthazar for the module change, the upgrade of the chart could be ours + Balthazar in CC if you want
[15:59:09] <elukey>	 klausman: from previous chats with Janis, it is best to proceed in this way for modules changes
[15:59:27] <elukey>	 1) first patch to just add the new file as copy of the last version
[15:59:38] <elukey>	 2) second patch that modifies the bit that you want to change
[15:59:46] <elukey>	 so it is easy to review
[16:00:15] <elukey>	 then the third patch is the one that you are working on
[16:00:36] <klausman>	 With review turnaround that will take a while
[16:00:55] <elukey>	 what do you mean?
[16:02:44] <klausman>	 every new patch is at least 5-10m of turnaround until it is reviewed. If I am unlucky, maybe a day
[16:04:09] <aiko>	 isaranto, kevinbazira: about locust, I know why it's not working. we didn’t add host header for revertrisk model. probably because we used api-gw before and forgot to add it when we changed to staging
[16:04:19] <elukey>	 I can let you send patches to serviceops without me reviewing, I can assure that they will ask for the same stuff, I am not suggesting anything that I haven't already experienced in the past myself. If you feel that I am delaying you please go ahead and ask other SREs :)
[16:04:23] <elukey>	 klausman: --^
[16:04:27] <aiko>	 I'll file a patch to fix it
[16:05:50] <isaranto>	 ok! nice work aiko. At least this makes sense!
[16:05:58] <klausman>	 I just feel it's a bit of a waste of other's time to have both Janis and Balthazar review the "I copied a file to 1.0.1" change
[16:06:48] <aiko>	 isaranto: thanks for mentioning the problem might be the host header :D
[16:07:19] <elukey>	 klausman: if you don't believe me, please check https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/912850
[16:07:29] <klausman>	 I believe you.
[16:08:20] <elukey>	 and it is not a waste of time, since the more people know how serviceops (and hence, the defacto k8s reviewers) prefers to see patches filed the more this knowledge is widespread
[16:19:31] <isaranto>	 I created the patch to remove some deployments from ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015080
[16:41:51] <jayme>	 klausman: I think you're missing a git add https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015029/15
[16:43:59] <klausman>	 Hm. I thought I'd rebased correctly but I must have messed something up
[16:45:28] <klausman>	 Aaah, I didn't redo the copy of 1.0.1
[16:59:46] <klausman>	 elukey: final change in the 3-split is ready now
[17:38:52] <isaranto>	 I found why I ended up with a 6GB image in pytorch + rocm. It was because there was another dependency from hugginface upstream for torch and that one installed the cpu version after I installed the rocm one
[17:39:36] <isaranto>	 so perhaps that kind of answers what we were wondering about having a torch requirement in the pytorch base image (although it will be good to test it separately)
[17:48:53] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9666520 (10isarantopoulos) After thinking about this and trying various things out (copying code or using a specific commit) I found the following 2 issues we need to resolve:...
[17:50:17] <elukey>	 isaranto: yeah I have done some tests with RR-ML, if I set the same torch dep of the base image in requirements.txt it pip tries to re-install it
[17:50:45] <elukey>	 so I think that for the pip purposes, we'll need to have stuff installed/linked under /opt to make blubber works
[17:51:10] <elukey>	 i need to check more in depth what blubber does, there may be an env var to set to force it to use other dirs
[17:51:16] <elukey>	 I'll work on it tomorrow :)
[17:51:23] <elukey>	 have a nice rest of the day folks!
[17:52:19] <aiko>	 logging off as well o/
[17:53:20] <isaranto>	 I will un-deploy the revscoring models from staging tomorrow
[17:53:35] <isaranto>	 I'm logging off too o/
[18:57:39] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666775 (10mfossati)
[18:59:05] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666776 (10mfossati) 05Open→03In progress
[18:59:24] <wikibugs>	 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): Host a logo detection model for Commons images - https://phabricator.wikimedia.org/T358676#9666778 (10mfossati)