[06:47:19] good morning :) [06:48:44] the DNS queries are around 600 rps for both ml-serve clusters now, that seems a reasonable good result. There is also the "zipkin" fix that needs to be applied - it needs a roll restart of all the pods, so it will be likely picked up during the next pod deployment [06:49:37] there is also a general improvement of kube api latencies: [06:49:38] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-mlserve&from=now-7d&to=now [07:15:32] kevinbazira: o/ [07:15:35] welcome back :) [07:15:53] thanks elukey o/ [07:16:02] hope I didn't miss much :) [07:16:40] not sure if you read it but Chris posted a Google doc link a few days ago on Slack, we are using it to collect current/future tasks. The idea is to groom/plan for them on the 5th, so that we can decide what to prioritize (Lift Wing MVP, ORES deprecation, etc..) [07:16:53] if you have tasks and work planned please add it to the doc! [07:17:03] yep, I've seen this doc. Thanks! [07:20:58] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [07:21:26] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Reduce DNS queries from istio-proxies to coredns on ML clusters - https://phabricator.wikimedia.org/T318814 (10elukey) 05Open→03Resolved a:03elukey Queries in both ml-serve clusters are now at around 600 rps, and when we started i... [07:50:33] created an initial dashboard for kserve's logs [07:50:33] https://logstash.wikimedia.org/app/dashboards#/view/fa21f5e0-42ef-11ed-ae81-bb78ac0690d3?_g=h@94e3f88&_a=h@957255d [07:50:46] then we can decide as we go what it is needed [07:52:01] next one is knative, and then we should be ok [08:26:09] aaand created also https://logstash.wikimedia.org/app/dashboards#/view/fedf64a0-42f4-11ed-ae81-bb78ac0690d3 [08:26:16] very basic but should do it for now [08:31:17] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [08:32:09] closed also the task related to dashboards [08:34:06] with 50 clients connected (and 50 conns), I see ~93 rps without errors for enwiki-goodfaith [08:34:25] p99 up to 600ms, that is acceptable with a single pod and so many clients [08:36:06] it looks really good for the moment [08:37:47] klausman: o/ when you have a moment can you update the api-gw task with the last status and next steps? [08:46:55] will do [08:55:47] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) >>! In T312518#8239403, @Isaac wrote: > More or less copying over a comment from another task that's more pertinent here though likely beyond scope: the ORES Extension has the [[https://www.mediawiki.... [09:03:21] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10elukey) >>! In T317768#8250242, @Ottomata wrote: >> do you think that this t... [09:10:39] 10Machine-Learning-Team: Move ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10elukey) [09:34:16] Man, the URL/Host scheme we're using is really breaking my brain [09:44:11] elukey: when you have a moment, can we chat about URL schemes and such? [09:53:44] sure [09:53:52] what is the main pain point? [09:54:07] It's way too complicated and repetitive [09:54:28] https://phabricator.wikimedia.org/P35301 This is the scheme as far as I understand ig [09:54:31] it* [09:55:16] yep [09:56:03] there are two aspects to keep in mind: [09:56:08] There are some aspects I find puzzling: why are parts repeated in the Host: header? And why is revscoring absent from the URL itself? Plus, while I can come up with URL from the Host, the opposite is not true. And non-revscoring models are a complete mystery. [09:56:43] 1) Istio uses the Host header to figure out what is the destination service/pod to pick up [09:57:05] 2) the URI is specific to Kserve, what the model server needs to know to run a score [09:57:40] the Host header is composed by isvc-service-name.k8s-namespace.wikimedia.org [09:58:03] "revscoring" pops up since it is part of the namespace [09:58:06] ah, so the namespace and the isvc both use revscoring (as a substring) [09:58:26] er no [09:58:38] nono it is only something in the k8s namespaces that we chose [09:58:59] So what would the Host header be for the last line of the paste? [09:59:30] isvc-name.namespace.wikimedia.org [10:00:56] to get the isvc-name: kubectl get isvc -n articletopic-outlink [10:01:03] So outlink-topic-model.articletopic-outlink.wikimedia.org [10:01:11] yeah [10:01:47] Hrm. I'm gonna have to have a hard think on how we can package this into something the API GW can route. E.g. it knows nothing about k8s namespaces [10:02:03] And I'd prefer to not have one routing entry per model we serve [10:03:31] we could use a regex based on the isvc name, it should suffice [10:04:17] like "enwiki-goodfaith" should use "revscoring-editquality-goodfaith", matching on "-goodfaith" etc.. [10:04:37] it is not handy I know but so far I didn't see a way to change the split view mentioned in 1) 2) [10:04:38] But what about non-revscoring models? [10:05:20] it is the same, they all have something unique that points to a namespace [10:05:38] there is also the use case of the experimental namespace though [10:06:11] And what do we tell users to use? The URL I can see as something like https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-articlequality:predict [10:06:29] Are we going to tell them to set a Host: header too? Or should the API GW fill that in? [10:06:43] the latter for sure [10:07:10] this is why we have the pathing_map in the chart IIRC [10:07:19] So it does need to know what isvc lives in what namespace [10:08:00] And we can't ever have two isvcs with the same name, but living in different namespaces. [10:08:50] Also, looking at e.g. line 5 of the paste, "editquality" in the namespace inside the Host: header comes out of nowhere [10:09:37] Which implies a hard rule that enwiki-damaging can only ever live in that namespace [10:10:24] (all of this assuming that the API-GW side of the URL scheme doesn't have anything that our URL scheme doesn't) [10:10:48] The current URLs look like https://api.wikimedia.org/service/lw/inference/v1/models/enwiki-articlequality:predict [10:11:15] in my opinion the above URL should not be used [10:11:33] we'd need something that shields a little more kserve's URI scheme [10:11:34] like [10:12:08] https://api.wikimedia.org/service/lw/inference/v1/models/articlequality/enwiki [10:13:02] and then let the API-GW build the backend URI [10:13:05] But that's just a re-arrangement of the same tokens (and dropping :predict). It still would not contain the full namespace name for line 5 as I mentioned. [10:13:31] sure, but this is something that could be stored in pathing_map or similar [10:13:42] we don't really need to let it bubble up for external users [10:13:44] So the config of the GW would "know" that damaging is always in editquality [10:14:32] we should probably call it "editquality-damaging" or similar, not only "damaging", but the GW would know the target namespace yes [10:15:13] So e.g. https://api.wikimedia.org/service/lw/inference/v1/models/editquality-damaging/enwiki [10:15:17] IIRC in the pathing map we can add regexes on the URI, and build a host header accordingly [10:15:30] something like that yes [10:15:33] Yes, I just want to avoid hardcoding too much there. [10:16:45] but the URI format should be decided by the team, IIRC Andy started a doc a while ago related to the URI scheme, don't really recall if we decided anything though [10:17:03] How do you feel about only dropping the revscoring- prefix from the isvc internally? I.e. change https://inference.discovery.wmnet:30443/v1/models/enwiki-damaging:predict to https://inference.discovery.wmnet:30443/v1/models/enwiki-editquality-damaging:predict [10:18:16] Because atm, the mapping between isvcs to namespace (or vice versa) is not very consistent, I think [10:18:45] we cannot drop the revscoring- bit, it is not an isvc setting but a k8s namespace one [10:18:53] why it is not consistent? [10:19:33] Well, if you just look at the namespaces, they sometime have two tokens (e.g. revscoring-articlequality) and sometimes three (revscoring-editquality-damaging) [10:20:07] yes the three ones are due to the big "editquality" split that we did a while ago [10:21:22] I don't want to say that what we have is perfect, we can change it, but we all agreed on naming at the time and now if we mark those choices as "inconsistent" it seems a bit counter-productive [10:21:58] Well, back then I wasn't aware how messy it would get :-/ [10:22:56] I'm just worried that the GW side of the config will become a huge mess of hard-to-understand REs, and that we will eventually run into collisions between model names and the like [10:23:39] it can happen for sure, but this seems to me a limitation of the API-GW rather than an issue on our side :) [10:23:59] It will be on us to maintain it, though [10:24:05] anyway, it seems that there is not enough clarity on naming, we should talk about it tomorrow [10:24:16] Yes. [10:24:17] so everybody is on the same page, and we decide what and how to change things [10:24:23] Sorry for being a grump about it. [10:25:33] no no it is fine to discuss things, and naming is hard (moreover istio+kserve schemes play another role). [10:26:07] and if we have to change something, we can still easily do it [10:26:32] but let's agree on the format that we want first, and work backwards [10:26:38] Ack. [10:27:01] for example, having the :predict suffix for me is confusing for users [10:27:09] The istio/isvc world is what it is, but if we can hide the ganrly bits of that from the users without setting ourselves on fire, we should do that [10:27:12] it kinda implies that there are other actions etc.. [10:27:24] I mean there are [10:27:28] like :explain [10:27:35] Except, we don't do that [10:28:20] yep [10:28:35] maybe a verb in the URI Path is better, not sure, but let's discuss/decide it [10:28:47] can you open a task with the team subscribed? [10:28:56] with what we have now etc.. so we can decide [10:29:06] ack [10:29:10] super thanks [10:37:32] * elukey lunch! [10:46:57] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10klausman) [10:47:19] elukey: ^^^ I tried summarizing our discussion and the state of matters a bit [10:47:28] Lunch indeed :) [10:48:01] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10klausman) p:05Triage→03High [11:14:47] (03PS1) 10AikoChou: outlink: add WP code list and increase gpllimit for MW API call [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/837642 [13:23:33] klausman: ack [13:37:32] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10elukey) There are two things to keep in mind when querying Lift Wing. Let's pick and example: ` https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith... [13:55:36] Morning all! [13:55:54] o/ [14:19:14] \o [15:50:28] aiko what does RRR stand for again. I forgot [15:51:05] chrisalbon: revision revert risk [15:51:10] thank you! [15:58:54] just tried benthos from stat1004, I was able to [15:59:05] 1) filter only enwiki traffic in mediawiki.revision-create [15:59:14] 2) create a new json for liftwing with the event [15:59:23] 3) contact the enwiki-goodfaith endpoint [15:59:44] I got only issues due to high traffic, at some point there is too backlog and I get errors [15:59:54] but I see messages flowing in the revision-score-test topic [16:01:02] really great [16:02:01] this is the config https://phabricator.wikimedia.org/P35321 [16:05:19] wow.. that's nice!! [16:08:20] yeah also added a rate limit now for max messages sent to lift wing, works like a charm [16:33:45] 10Machine-Learning-Team, 10Data-Engineering, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10elukey) [16:34:03] aiko: opened --^ [16:37:21] going afk for the evening, have a nice rest of the day folks! [16:38:10] bye Luca! :D [16:42:57] 10Machine-Learning-Team, 10Data-Engineering, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10JAllemandou) Ping @gmodena, as we talked about this exact topic this morning :) [16:55:24] Benthos looks really simple to use. Nice [17:54:06] 10Machine-Learning-Team, 10Data-Engineering, 10observability: Evaluate Benthos as stream processor - https://phabricator.wikimedia.org/T319214 (10gmodena) thanks for the ping @JAllemandou . This looks really interesting, especially for ease of deployment. @elukey do you know if `http_client` calls are async... [23:56:57] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10Ottomata) We put it in our current sprint to get a WIP 'test topic' version...