[08:16:05] hello folks [08:22:42] Mornin'! [08:24:24] Hey! [08:58:42] (03CR) 10Elukey: "This change is ready for review." (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [09:07:28] elukey: I'm game for the Istio work we talked about yesterday. [09:12:21] klausman_: sure, maybe in 15 mins? [09:12:29] sgtm [09:25:01] klausman_: I am checking the status of the revert risk pods in ml-serve-eqiad [09:25:41] the storage initializer seems to fail to get the model from swift [09:26:18] did we recently update RR? [09:26:45] ahyes, Monday [09:27:16] but only some of them have problems [09:28:03] ah wait they belong to different knative revisions [09:28:12] the ones that are not coming up are revision 003 [09:28:46] the 004 (newer) ones seem to be ok. I wonder why 003 weren't just terminated and removed [09:29:03] so you can see them via kubectl get revertrisk -n revertrisk [09:29:14] yes [09:29:21] the err sorry get revision [09:29:38] anyway, one possible way is to explicitly delete the revision via kubectl [09:30:07] I checked with # kubectl get pods -n revertrisk -o wide [09:30:38] done, the weird pods are gone [09:30:40] and the breaking ones are also not on unique machines, i.e. there are working 004 ones on the same machines as breaking 003 ones, so it's also not a machine-specific problem [09:30:57] the one in terminating state is my fault, I tried to delete a 004 pod as test [09:31:18] The terminating one I see is a 003 one [09:31:33] mmm yes interesting [09:31:45] ok then it was not mine [09:31:57] it is just taking more time [09:32:04] ok so in theory we should be good [09:32:13] It's down to 1 subcontainer now [09:32:23] codfw is fine afaics [09:32:36] can confirm [09:33:05] (as is staging, still running 003, but those indices are not always comparable) [09:34:26] helmfile ... diff shows a difference for rr in staging, is that known/expected/wanted? [09:37:05] generally no, what kind of differences? [09:37:15] because Aiko was trying to deploy new images iirc [09:37:40] - chart: kserve-inference-0.4.0 [09:37:42] + chart: kserve-inference-0.4.1 [09:37:51] and new docker image [09:38:02] basically the same 003 vs 004 difference as in prod [09:38:30] yeah I think it is fine to deploy [09:38:32] docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-revertrisk-multilingual:2023-03-20-162345-publish -> ...:2023-06-14-102454-publish [09:38:40] will do [09:52:30] klausman: ok ready! [09:53:07] all doneI've been trying to find the bit of istio config that controls what resources can be reached, but no luck so far [09:53:38] https://meet.google.com/tnn-ymyj-dho [09:58:49] klausman: so I don't have more headphones to test, let's do it in here [09:58:58] Alrighty [09:59:01] sorry I'll try to fix what doesn't work, very werid [09:59:04] *weird [09:59:09] new setup/computer?> [09:59:20] I have bookworm but yesterday all worked [09:59:37] Odd. [09:59:46] I recently switched to trixe :) [09:59:49] trixie* [10:00:33] no idea what it is :) [10:01:04] Ok, so where are the limits for what can be reached configured for istio? [10:01:23] so they are in admin-ng, under ml-serve.yaml [10:01:34] basically in istio there are [10:01:38] - virtual services [10:01:42] - destination rules [10:01:49] - services [10:02:29] the first one is basically a proxy rule, the second adds more details to the target of the proxy rule, and the latter is to form a list of things that can be called outside the mesh [10:02:36] you can check the resources with [10:02:51] kubectl get {vs, dr, se} -n etc.. [10:03:36] we have essentially two configs [10:03:39] 1) api-ro.discovery.wmnet [10:03:49] 2) eventgate-main.discovery.wmnet [10:04:50] they are all under the "mesh" entry in ml-serve.yaml [10:05:33] so, let's say that in model.py we make a call for api-ro.discovery.wmnet with en.wikipedia.org as host header [10:05:51] istio intercepts it (via iptables) and checks its virtual service list [10:06:19] if it finds a match, it tries to see the target of the proxy, if it is allowed and how to call it [10:06:34] Ah, so meta doesn't work because it's wikimedia.org, nad not any of the existing host patterns (pedia, data, quote etc) [10:07:06] (under mediawiki-api-vs) [10:07:06] exactly [10:07:26] all the above settings are under the `knative-serving` namespace because they are common [10:07:44] there are also virtual services in specific model server namespaces, these are added automatically by kserve [10:08:01] for example, check `kubectl describe vs -n knative-serving` [10:08:03] Question is: do we only want to allow meta.wm.org for now, or add another *-ed entry. I feel the former is preferable, since I am unsure if there's anything under wm.org we don't want to allow [10:09:12] so in general I think that our streams are not really for meta, so we can skip it. But let's say that we wanted to have it [10:09:15] I presume ml-staging also inherits from this file? [10:09:39] yes yes it is all defined in the helmfile config of admin_ng [10:10:01] also verified by using describe on st-codfw [10:10:03] so, first thing is to understand what endpoints serves meta.wikimedia.org [10:10:28] IIRC it should be api-ro.discovery.wmnet, so in this case the change should be only related to the virtual service [10:10:42] otherwise we'd need to add another endpoint etc.. [10:11:05] another important bit - if you check the api-ro's virtual service, you'll see that it listens for port 80 [10:11:06] yeah, another s_e stanza [10:11:20] And seems to rewrite to 443? [10:11:25] correct [10:11:31] it is a trick to have http metrics [10:11:40] because otherwise we'd have only up to tls ones [10:11:56] ah, we wouldn't know any Host: headers etc [10:11:59] envoy sees all the http outgoing conns and we have visibility on them [10:12:31] so we would know how to proxy requests to api-ro.discovery.wmnet:443 because of things like SNI [10:12:41] but model.py's code would (and used to do) create tls conns [10:13:16] so envoy in istio-proxy would only be able to see the SNI, no idea about HTTP details [10:13:24] ah, ack [10:13:45] this is how we made it work so far, that doesn't mean it can't be changed etc.. [10:14:00] I just wanted to hilight how it works :) [10:14:14] Nah, it seems a good approach. Intercepting traffic between a pod and its isito siedcar requires permissions that make that kinda attack mood [10:14:17] moot* [10:14:43] this is not transparent anymore for our code, since we need to use http in model.py's code [10:14:46] and not https etc.. [10:14:59] but it is the only way that upstream suggested to have metrics [10:15:39] The only other approach I can think of is istio faking/splicing https entirely and that seems a lot more brittle and less secure overall [10:15:56] see for example https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=articletopic-outlink&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [10:16:06] i.e. make the model believe whatever istio says, and istio terminating and the re-originating the TLS conn [10:18:09] So do we want a blanket rule (*.wm.org) or a meta-specific one (meta.wm.org)? [10:19:15] I would avoid the *.wm.org since it seems broad (it would be limited to api-ro calls though), but in theory we don't need any meta rule atm [10:19:41] ack, just wondering in general. wm.org seems a bit more risky for *. [10:19:53] or better, I am not sure if meta.w.o is a valid target for rr [10:20:05] if so we can come up with a patch [10:20:20] maybe aiko_ has thoughts on this? (if we need to support meta or not for revert risk) [10:21:08] klausman: in case *.wm.org is probably ok, we add it only for api-ro's conns, so should be fine [10:21:17] maybe we can quickly test it works? [10:21:35] on staging you can edit the vs and add the new target among the allowed host headers [10:21:44] then make a call that fails and see if anything changes [10:21:50] wdyt? [10:26:37] Sounds l;ike an idea [10:27:13] when you say "edit the vs", do you mean editing the -staging.yml and pushing it? Or go through Gerrit? [10:27:24] (03CR) 10Elukey: llm: 8bit quantization (034 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:28:03] klausman: no no I mean a brutal kubectl edit vs blabla [10:28:27] oh, scary [10:28:34] I use it for small things in staging to quickly test things, of course it can be dangerous [10:28:49] just wanted to let you try it [10:29:05] it may take me some time to get the cmdline right, but I'll give it a go [10:29:19] sometimes without it we'd need a million code changes to reach a result [10:29:50] it is also a good ally if things are really broken and you can't deploy etc.. [10:31:19] Huh, this was easier than expected :) [10:31:38] Just `kubectl describe vs -n knative-serving` and then add a line to the Hosts: list [10:31:44] er s/describe/edit/ [10:31:51] exactly yes [10:31:52] (describe then to verify) [10:32:23] (03PS4) 10Ilias Sarantopoulos: llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) [10:33:25] elukey: btw: [10:33:33] # ~klausman/kubetail revertrisk* -n revertrisk -c kserve-container [10:33:34] Will tail 2 logs... [10:33:36] revertrisk-language-agnostic-predictor-default-00004-deplog22bg [10:33:38] revertrisk-multilingual-predictor-default-00004-deploymentp2zpd [10:35:10] ah nice [10:35:21] (03CR) 10Ilias Sarantopoulos: "Sorry for the leftovers!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:35:58] (03CR) 10Elukey: [C: 03+1] "thanks for the patience, LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:35:58] How does one select the metawiki? with the lang: field of the query? [10:36:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [10:37:17] Something to try later, I am being dragged away for lunch. bbiab! [10:37:29] lunch as well [10:37:41] (03Merged) 10jenkins-bot: llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) (owner: 10Ilias Sarantopoulos) [11:37:08] Going for lunch [12:37:47] elukey: o/ we don't need to support metawiki for any models as far as I know [12:38:43] super [12:38:47] thanks for confirming :) [12:39:07] the main thing that I'd like to understand is where meta.wikimedia.org came out from [12:41:34] because the wiki_id in the source event is "metawiki" [12:42:07] then get_lang() returns lang=meta https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/python/preprocess_utils.py#9 [12:42:47] then the host set in https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/outlink-topic-model/transformer/transformer.py#53 [12:42:52] "{lang}.wikipedia.org" [12:43:05] ah it was .wikipedia.org? [12:43:12] I thought I've read .wikimedia.org [12:43:39] ah [12:43:56] if it was .wikimedia.org, I have no idea [12:45:41] https://phabricator.wikimedia.org/P49457 [12:45:48] yeah it is .wikimedia.org.. [12:48:51] weird [12:48:59] it must be something behind the scenes [12:53:12] (03PS14) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) [12:57:08] (03CR) 10Ladsgroup: [C: 03+2] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [12:57:58] wooooooowwwwwwww [12:58:09] Amir1, isaranto - \o/ [12:59:31] we are at the meeting, going through everything [13:05:39] really nice milestone [13:05:59] 10Machine-Learning-Team: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10klausman) Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not. For now, the consensus is that we shou... [13:13:27] we merged this! it is not enabled yet, I'll create a patch to enable it on beta cluster and we'll start from there! [13:13:31] \o/ [13:13:43] thanks Amir1 for being so awesome! :) [13:14:15] elukey: bwt, I think I have figured out the threshold thing in Grafana I talked about. [13:14:55] (03Merged) 10jenkins-bot: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/926420 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [13:15:12] <3 Thank you for doing all of the work [13:15:25] klausman: ah nice! [13:15:36] really nice work folks! [13:15:55] thanks for the support Amir1 <3 [13:16:34] we are not 100% on track for the September deadline but it is looking good [13:23:52] anyway, we have ores-legacy, it should be more than enough [13:27:09] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work): Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10elukey) >>! In T328276#8924827, @EBernhardson wrote: > Looked into this, it looks like progress is being made but it's not qui... [13:27:21] I followed up in https://phabricator.wikimedia.org/T328276 to alert Search that the new stream is ready for their review [13:27:38] IIUC if Search migrates to outlink we'll be able to deprecate revision-score [13:28:01] and with all the work done on the extension, we'll basically move to lift wing 90% of ORES' traffic [13:28:27] the remaining clients will be WME and bots, the former already trying to migrate to Lift Wing [13:28:38] soooo ores-legacy should be more than capable to support the old bots [13:28:48] * elukey fingers crossed [13:28:59] \o/ [13:29:29] are we going to be able to decommission a system? That woudl be amazing! [13:30:06] gehel: all ores nodes, plus its redis cache, custom ChangeProp code, etc.. [13:31:14] that's going to deserve a few beers when done! [13:58:04] isaranto: o/ [13:58:13] o/ [13:58:25] self.quantized = os.environ.get("QUANTIZED", False) - I just realized that this will probably lead to a string variable with 'True' no? [13:58:31] not a bool [13:58:42] 🤦 [13:58:56] yeah I totally missed it [13:58:59] i'll fix it [13:59:00] my bad [13:59:02] <3 [13:59:07] nono I didn't think about it either [15:03:23] (03PS1) 10Ilias Sarantopoulos: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 [15:04:45] (03PS2) 10Ilias Sarantopoulos: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 [15:06:06] (03CR) 10Elukey: [C: 03+1] "nice it seems to work with "true" "True" etc.." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:09:54] I am going to log off earlier today folks! [15:09:55] o/ [15:10:07] have a nice rest of the day [15:11:52] (03CR) 10CI reject: [V: 04-1] llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:12:34] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:14:11] elukey:threshold thing definitely figured out. But the grafana-side support is still buggy. Will follow-up with Observability on it [15:14:26] (I can also show you tomorrow how it is implemented) [15:29:22] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:30:52] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:31:54] (03Merged) 10jenkins-bot: llm: fix boolean param [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931951 (owner: 10Ilias Sarantopoulos) [15:44:31] I'll have to go afk in a bit so I won't attend the kserve community meeting [15:44:56] waiting to deploy the latest patches for the LLMs and then will go offline for the day! [16:15:27] I totally forgot to set the device_map to "auto" as I had done while testing [16:15:31] * isaranto sighs [16:17:14] a no I didnt use device_map this is weird. will look into it tomorrow [16:17:26] good afternoon every1 [16:25:45] o/ bye Ilias :) [17:30:41] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) 05Open→03Resolved >>! In T308134#8899799, @Etonkovidova wrote: > @Trizek-WMF , @Sgs - no follow-ups for ** jbo.wp** a... [18:33:59] 10Machine-Learning-Team, 10Research (FY2022-23-Research-April-June): (stretch) Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10achou) [18:39:43] (03PS1) 10AikoChou: readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) [18:46:54] (03CR) 10CI reject: [V: 04-1] readability: add readability model server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931987 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [21:43:26] 10Machine-Learning-Team, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: Python torch fills disk of CI Jenkins instances - https://phabricator.wikimedia.org/T338317 (10hashar) As a follow up, I have filed {T340070}