[08:16:05] <elukey>	 hello folks
[08:22:42] <klausman_>	 Mornin'!
[08:24:24] <isaranto>	 Hey!
[08:58:42] <wikibugs>	 (03CR) 10Elukey: "This change is ready for review." (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[09:07:28] <klausman_>	 elukey: I'm game for the Istio work we talked about yesterday.
[09:12:21] <elukey>	 klausman_: sure, maybe in 15 mins?
[09:12:29] <klausman_>	 sgtm
[09:25:01] <elukey>	 klausman_: I am checking the status of the revert risk pods in ml-serve-eqiad
[09:25:41] <elukey>	 the storage initializer seems to fail to get the model from swift
[09:26:18] <klausman_>	 did we recently update RR?
[09:26:45] <klausman_>	 ahyes, Monday
[09:27:16] <elukey>	 but only some of them have problems
[09:28:03] <elukey>	 ah wait they belong to different knative revisions
[09:28:12] <elukey>	 the ones that are not coming up are revision 003
[09:28:46] <klausman_>	 the 004 (newer) ones seem to be ok. I wonder why 003 weren't just terminated and removed
[09:29:03] <elukey>	 so you can see them via kubectl get revertrisk -n revertrisk
[09:29:14] <klausman_>	 yes
[09:29:21] <elukey>	 the err sorry get revision
[09:29:38] <elukey>	 anyway, one possible way is to explicitly delete the revision via kubectl
[09:30:07] <klausman_>	 I checked with # kubectl get pods -n revertrisk -o wide
[09:30:38] <elukey>	 done, the weird pods are gone
[09:30:40] <klausman_>	 and the breaking ones are also not on unique machines, i.e. there are working 004 ones on the same machines as breaking 003 ones, so it's also not a machine-specific problem
[09:30:57] <elukey>	 the one in terminating state is my fault, I tried to delete a 004 pod as test
[09:31:18] <klausman_>	 The terminating one I see is a 003 one
[09:31:33] <elukey>	 mmm yes interesting
[09:31:45] <elukey>	 ok then it was not mine
[09:31:57] <elukey>	 it is just taking more time
[09:32:04] <elukey>	 ok so in theory we should be good
[09:32:13] <klausman_>	 It's down to 1 subcontainer now
[09:32:23] <elukey>	 codfw is fine afaics
[09:32:36] <klausman_>	 can confirm
[09:33:05] <klausman_>	 (as is staging, still running 003, but those indices are not always comparable)
[09:34:26] <klausman>	 helmfile ... diff shows a difference for rr in staging, is that known/expected/wanted?
[09:37:05] <elukey>	 generally no, what kind of differences?
[09:37:15] <elukey>	 because Aiko was trying to deploy new images iirc
[09:37:40] <klausman>	 -     chart: kserve-inference-0.4.0
[09:37:42] <klausman>	 +     chart: kserve-inference-0.4.1
[09:37:51] <klausman>	 and new docker image
[09:38:02] <klausman>	 basically the same 003 vs 004 difference as in prod
[09:38:30] <elukey>	 yeah I think it is fine to deploy
[09:38:32] <klausman>	 docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual:2023-03-20-162345-publish -> ...:2023-06-14-102454-publish
[09:38:40] <klausman>	 will do
[09:52:30] <elukey>	 klausman: ok ready!
[09:53:07] <klausman>	 all doneI've been trying to find the bit of istio config that controls what resources can be reached, but no luck so far
[09:53:38] <elukey>	 https://meet.google.com/tnn-ymyj-dho
[09:58:49] <elukey>	 klausman: so I don't have more headphones to test, let's do it in here
[09:58:58] <klausman>	 Alrighty
[09:59:01] <elukey>	 sorry I'll try to fix what doesn't work, very werid
[09:59:04] <elukey>	 *weird
[09:59:09] <klausman>	 new setup/computer?>
[09:59:20] <elukey>	 I have bookworm but yesterday all worked
[09:59:37] <klausman>	 Odd.
[09:59:46] <klausman>	 I recently switched to trixe :)
[09:59:49] <klausman>	 trixie*
[10:00:33] <elukey>	 no idea what it is :)
[10:01:04] <klausman>	 Ok, so where are the limits for what can be reached configured for istio?
[10:01:23] <elukey>	 so they are in admin-ng, under ml-serve.yaml
[10:01:34] <elukey>	 basically in istio there are
[10:01:38] <elukey>	 - virtual services
[10:01:42] <elukey>	 - destination rules
[10:01:49] <elukey>	 - services
[10:02:29] <elukey>	 the first one is basically a proxy rule, the second adds more details to the target of the proxy rule, and the latter is to form a list of things that can be called outside the mesh
[10:02:36] <elukey>	 you can check the resources with
[10:02:51] <elukey>	 kubectl get {vs, dr, se} -n etc..
[10:03:36] <elukey>	 we have essentially two configs 
[10:03:39] <elukey>	 1) api-ro.discovery.wmnet
[10:03:49] <elukey>	 2) eventgate-main.discovery.wmnet
[10:04:50] <elukey>	 they are all under the "mesh" entry in ml-serve.yaml
[10:05:33] <elukey>	 so, let's say that in model.py we make a call for api-ro.discovery.wmnet with en.wikipedia.org as host header
[10:05:51] <elukey>	 istio intercepts it (via iptables) and checks its virtual service list
[10:06:19] <elukey>	 if it finds a match, it tries to see the target of the proxy, if it is allowed and how to call it
[10:06:34] <klausman>	 Ah, so meta doesn't work because it's wikimedia.org, nad not any of the existing host patterns (pedia, data, quote etc)
[10:07:06] <klausman>	 (under mediawiki-api-vs)
[10:07:06] <elukey>	 exactly
[10:07:26] <elukey>	 all the above settings are under the `knative-serving` namespace because they are common
[10:07:44] <elukey>	 there are also virtual services in specific model server namespaces, these are added automatically by kserve
[10:08:01] <elukey>	 for example, check `kubectl describe vs -n knative-serving`
[10:08:03] <klausman>	 Question is: do we only want to allow meta.wm.org for now, or add another *-ed entry. I feel the former is preferable, since I am unsure if there's anything under wm.org we don't want to allow
[10:09:12] <elukey>	 so in general I think that our streams are not really for meta, so we can skip it. But let's say that we wanted to have it
[10:09:15] <klausman>	 I presume ml-staging also inherits from this file?
[10:09:39] <elukey>	 yes yes it is all defined in the helmfile config of admin_ng
[10:10:01] <klausman>	 also verified by using describe on st-codfw
[10:10:03] <elukey>	 so, first thing is to understand what endpoints serves meta.wikimedia.org
[10:10:28] <elukey>	 IIRC it should be api-ro.discovery.wmnet, so in this case the change should be only related to the virtual service
[10:10:42] <elukey>	 otherwise we'd need to add another endpoint etc..
[10:11:05] <elukey>	 another important bit - if you check the api-ro's virtual service, you'll see that it listens for port 80
[10:11:06] <klausman>	 yeah, another s_e stanza
[10:11:20] <klausman>	 And seems to rewrite to 443?
[10:11:25] <elukey>	 correct
[10:11:31] <elukey>	 it is a trick to have http metrics 
[10:11:40] <elukey>	 because otherwise we'd have only up to tls ones
[10:11:56] <klausman>	 ah, we wouldn't know any Host: headers etc
[10:11:59] <elukey>	 envoy sees all the http outgoing conns and we have visibility on them
[10:12:31] <elukey>	 so we would know how to proxy requests to api-ro.discovery.wmnet:443 because of things like SNI
[10:12:41] <elukey>	 but model.py's code would (and used to do) create tls conns
[10:13:16] <elukey>	 so envoy in istio-proxy would only be able to see the SNI, no idea about HTTP details
[10:13:24] <klausman>	 ah, ack
[10:13:45] <elukey>	 this is how we made it work so far, that doesn't mean it can't be changed etc..
[10:14:00] <elukey>	 I just wanted to hilight how it works :)
[10:14:14] <klausman>	 Nah, it seems a good approach. Intercepting traffic between a pod and its isito siedcar requires permissions that make that kinda attack mood
[10:14:17] <klausman>	 moot*
[10:14:43] <elukey>	 this is not transparent anymore for our code, since we need to use http in model.py's code
[10:14:46] <elukey>	 and not https etc..
[10:14:59] <elukey>	 but it is the only way that upstream suggested to have metrics
[10:15:39] <klausman>	 The only other approach I can think of is istio faking/splicing https entirely and that seems a lot more brittle and less secure overall
[10:15:56] <elukey>	 see for example https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=articletopic-outlink&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99
[10:16:06] <klausman>	 i.e. make the model believe whatever istio says, and istio terminating and the re-originating the TLS conn
[10:18:09] <klausman>	 So do we want a blanket rule (*.wm.org) or a meta-specific one (meta.wm.org)?
[10:19:15] <elukey>	 I would avoid the *.wm.org since it seems broad (it would be limited to api-ro calls though), but in theory we don't need any meta rule atm
[10:19:41] <klausman>	 ack, just wondering in general. wm.org seems a bit more risky for *.
[10:19:53] <elukey>	 or better, I am not sure if meta.w.o is a valid target for rr
[10:20:05] <elukey>	 if so we can come up with a patch
[10:20:20] <elukey>	 maybe aiko_ has thoughts on this? (if we need to support meta or not for revert risk)
[10:21:08] <elukey>	 klausman: in case *.wm.org is probably ok, we add it only for api-ro's conns, so should be fine
[10:21:17] <elukey>	 maybe we can quickly test it works?
[10:21:35] <elukey>	 on staging you can edit the vs and add the new target among the allowed host headers
[10:21:44] <elukey>	 then make a call that fails and see if anything changes
[10:21:50] <elukey>	 wdyt?
[10:26:37] <klausman>	 Sounds l;ike an idea
[10:27:13] <klausman>	 when you say "edit the vs", do you mean editing the -staging.yml and pushing it? Or go through Gerrit?
[10:27:24] <wikibugs>	 (03CR) 10Elukey: llm: 8bit quantization (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[10:28:03] <elukey>	 klausman: no no I mean a brutal kubectl edit vs blabla
[10:28:27] <klausman>	 oh, scary
[10:28:34] <elukey>	 I use it for small things in staging to quickly test things, of course it can be dangerous 
[10:28:49] <elukey>	 just wanted to let you try it
[10:29:05] <klausman>	 it may take me some time to get the cmdline right, but I'll give it a go
[10:29:19] <elukey>	 sometimes without it we'd need a million code changes to reach a result
[10:29:50] <elukey>	 it is also a good ally if things are really broken and you can't deploy etc..
[10:31:19] <klausman>	 Huh, this was easier than expected :)
[10:31:38] <klausman>	 Just `kubectl describe vs -n knative-serving` and then add a line to the Hosts: list
[10:31:44] <klausman>	 er s/describe/edit/
[10:31:51] <elukey>	 exactly yes
[10:31:52] <klausman>	 (describe then to verify)
[10:32:23] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583)
[10:33:25] <klausman>	 elukey: btw:
[10:33:33] <klausman>	 # ~klausman/kubetail revertrisk* -n revertrisk -c kserve-container
[10:33:34] <klausman>	 Will tail 2 logs...
[10:33:36] <klausman>	 revertrisk-language-agnostic-predictor-default-00004-deplog22bg
[10:33:38] <klausman>	 revertrisk-multilingual-predictor-default-00004-deploymentp2zpd
[10:35:10] <elukey>	 ah nice
[10:35:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Sorry for the leftovers!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[10:35:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "thanks for the patience, LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[10:35:58] <klausman>	 How does one select the metawiki? with the lang: field of the query?
[10:36:41] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[10:37:17] <klausman>	 Something to try later, I am being dragged away for lunch. bbiab!
[10:37:29] <elukey>	 lunch as well
[10:37:41] <wikibugs>	 (03Merged) 10jenkins-bot: llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos)
[11:37:08] <isaranto>	 Going for lunch
[12:37:47] <aiko_>	 elukey: o/ we don't need to support metawiki for any models as far as I know
[12:38:43] <elukey>	 super
[12:38:47] <elukey>	 thanks for confirming :)
[12:39:07] <elukey>	 the main thing that I'd like to understand is where meta.wikimedia.org came out from
[12:41:34] <aiko_>	 because the wiki_id in the source event is "metawiki"
[12:42:07] <aiko_>	 then get_lang() returns lang=meta https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/python/preprocess_utils.py#9 
[12:42:47] <aiko_>	 then the host set in https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/outlink-topic-model/transformer/transformer.py#53
[12:42:52] <aiko_>	 "{lang}.wikipedia.org"
[12:43:05] <elukey>	 ah it was .wikipedia.org?
[12:43:12] <elukey>	 I thought I've read .wikimedia.org
[12:43:39] <aiko_>	 ah
[12:43:56] <aiko_>	 if it was .wikimedia.org, I have no idea
[12:45:41] <aiko_>	 https://phabricator.wikimedia.org/P49457
[12:45:48] <aiko_>	 yeah it is .wikimedia.org..
[12:48:51] <elukey>	 weird
[12:48:59] <elukey>	 it must be something behind the scenes
[12:53:12] <wikibugs>	 (03PS14) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170)
[12:57:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[12:57:58] <elukey>	 wooooooowwwwwwww
[12:58:09] <elukey>	 Amir1, isaranto - \o/
[12:59:31] <Amir1>	 we are at the meeting, going through everything
[13:05:39] <elukey>	 really nice milestone
[13:05:59] <wikibugs>	 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not.  For now, the consensus is that we shou...
[13:13:27] <isaranto>	 we merged this! it is not enabled yet, I'll create a patch to enable it on beta cluster and we'll start from there!
[13:13:31] <isaranto>	 \o/
[13:13:43] <isaranto>	 thanks Amir1 for being so awesome! :)
[13:14:15] <klausman>	 elukey: bwt, I think I have figured out the threshold thing in Grafana I talked about.
[13:14:55] <wikibugs>	 (03Merged) 10jenkins-bot: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[13:15:12] <Amir1>	 <3 Thank you for doing all of the work
[13:15:25] <elukey>	 klausman: ah nice!
[13:15:36] <elukey>	 really nice work folks!
[13:15:55] <elukey>	 thanks for the support Amir1 <3
[13:16:34] <elukey>	 we are not 100% on track for the September deadline but it is looking good
[13:23:52] <elukey>	 anyway, we have ores-legacy, it should be more than enough
[13:27:09] <wikibugs>	 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work): Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10elukey) >>! In T328276#8924827, @EBernhardson wrote: > Looked into this, it looks like progress is being made but it's not qui...
[13:27:21] <elukey>	 I followed up in https://phabricator.wikimedia.org/T328276 to alert Search that the new stream is ready for their review
[13:27:38] <elukey>	 IIUC if Search migrates to outlink we'll be able to deprecate revision-score
[13:28:01] <elukey>	 and with all the work done on the extension, we'll basically move to lift wing 90% of ORES' traffic
[13:28:27] <elukey>	 the remaining clients will be WME and bots, the former already trying to migrate to Lift Wing
[13:28:38] <elukey>	 soooo ores-legacy should be more than capable to support the old bots
[13:28:48] * elukey fingers crossed
[13:28:59] <gehel>	 \o/
[13:29:29] <gehel>	 are we going to be able to decommission a system? That woudl be amazing!
[13:30:06] <elukey>	 gehel: all ores nodes, plus its redis cache, custom ChangeProp code, etc..
[13:31:14] <gehel>	 that's going to deserve a few beers when done!
[13:58:04] <elukey>	 isaranto: o/
[13:58:13] <isaranto>	 o/
[13:58:25] <elukey>	 self.quantized = os.environ.get("QUANTIZED", False) - I just realized that this will probably lead to a string variable with 'True' no?
[13:58:31] <elukey>	 not a bool
[13:58:42] <isaranto>	 🤦
[13:58:56] <elukey>	 yeah I totally missed it
[13:58:59] <isaranto>	 i'll fix it
[13:59:00] <isaranto>	 my bad
[13:59:02] <elukey>	 <3
[13:59:07] <elukey>	 nono I didn't think about it either
[15:03:23] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951
[15:04:45] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951
[15:06:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "nice it seems to work with "true" "True" etc.." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:09:54] <elukey>	 I am going to log off earlier today folks!
[15:09:55] <elukey>	 o/
[15:10:07] <elukey>	 have a nice rest of the day
[15:11:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:12:34] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:14:11] <klausman>	 elukey:threshold thing definitely figured out. But the grafana-side support is still buggy. Will follow-up with Observability on it
[15:14:26] <klausman>	 (I can also show you tomorrow how it is implemented)
[15:29:22] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:30:52] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos)
[15:44:31] <isaranto>	 I'll have to go afk in a bit so I won't attend the kserve community meeting 
[15:44:56] <isaranto>	 waiting to deploy the latest patches for the LLMs and then will go offline for the day!
[16:15:27] <isaranto>	 I totally forgot to set the device_map to "auto" as I had done while testing
[16:15:31] * isaranto sighs
[16:17:14] <isaranto>	 a no I didnt use device_map this is weird. will look into it tomorrow
[16:17:26] <isaranto>	 good afternoon every1
[16:25:45] <aiko_>	 o/ bye Ilias :)
[17:30:41] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) 05Open→03Resolved >>! In T308134#8899799, @Etonkovidova wrote: > @Trizek-WMF , @Sgs - no follow-ups for  ** jbo.wp** a...
[18:33:59] <wikibugs>	 10Machine-Learning-Team, 10Research (FY2022-23-Research-April-June): (stretch) Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10achou)
[18:39:43] <wikibugs>	 (03PS1) 10AikoChou: readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182)
[18:46:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou)
[21:43:26] <wikibugs>	 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) As a follow up, I have filed {T340070}