[06:44:48] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) After a chat with @daniel on slack we realized that the endpoint is indeed used: https://w.wiki/6rKU @akosiaris I don't think... [06:46:34] hello folks [06:50:45] Buongiorno! [07:10:54] (03PS12) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) [07:11:18] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) @kevinbazira do we also need to host the GapFinder frontend? I would personally only keep the API, and let others to maintain/build UIs (so we don't need bower etc.. in our code). I... [07:12:24] so it seems that the current nodejs recommendation-api is being used :D [07:12:40] so we'll need to find a new name for it, try to think about alternative names! [07:21:41] commuting to the office, back in a bit [07:35:04] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) [07:41:57] (03CR) 10Ilias Sarantopoulos: [C: 03+2] feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:43:03] (03Merged) 10jenkins-bot: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [07:43:53] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) The ML team agreed to migrate the [[ https://gerrit.wikimedia.org/g/research/recommendation-api | recommendation-api ]], this application has both a backend (API) and a fronten... [07:45:48] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10elukey) I am strongly against migrating the front-end, we should not do it in my opinion. It is not the idea of a service, since the UI can be easily built elsewhere (for example, reuse the... [08:06:05] kevinbazira: o/ shall we discuss here the hosting of GapFinder? [08:06:31] I would prefer not to have it on our infra to be honest, and concentrate only on the API [08:06:50] elukey: it's best to discuss this with Chris around. [08:07:12] kevinbazira: we can discuss it as a team, then Chris will read, I don't see problems :) [08:07:40] my point is that we'll need to expose the UI to the internet, and so far we only expose APIs internally and via API-Gateway [08:07:52] this is why I was a bit puzzled when I've read about Gap Finder [08:07:59] I have shared my take here: https://phabricator.wikimedia.org/T338805#8947913 [08:08:43] kevinbazira: sure but we can also discuss pros and cons, did you read what I wrote above? [08:13:47] my intention is only to have open discussions in the team, if this is not well received I am sorry, but we should do it more often [08:14:12] anyway, I'll raise my point to the team meeting, we can discuss it in there [08:18:35] imo if the UI is quite simple it isn't much of a problem to host it from the perspective of the code but what Luca raises above worries me a bit. [08:19:26] my question is isn't there any other place we can host the UI? ( a place that hosts UIs like wmcloud or sth) [08:21:51] +1 exactly, we shouldn't be in the game of designing and maintaining uis [08:22:14] we offer apis and people build tools around it, way better protocol imho [08:22:54] and I think that the goal of Research is to offer an API to Content Translation for their stuff, GapFinder was probably something built to show the API's power [08:23:34] we can always expose the new recommendation-api via the API-Gateway (like link-recommendation) and anything in wmfcloud will be able to use it [08:24:27] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10isarantopoulos) To me it would seem better to host the UI in some other place (e.g. wmcoud) since Lift Wing is a platform for backend services. That way we could have rate limiting through... [08:46:24] isaranto: I wanted to tweak a little the readinedProbe settings for falcon, I think that NLLB may need them, in case lemme know [08:46:32] do you think that 6G are enough? [08:46:47] ack [08:47:12] 6G should be enough. im going tou test it and see [08:47:48] super, let us know how it goes [09:49:17] Morning! [09:50:06] aiko_: just for clarificatin, for the changeprop filter RE, we want matches on the typical language wikis (enwiki, frwiki etc), also wikiquoteswiki, but not commons, wikidata and wiktionary? [09:54:52] Also, I am unsure which Regex engine the pattern would run in, and that is important, since e.g. the Go base library regex engine does not have negative lookahead. I'm currently trying to figure out which RE engine is used in the config there (since Envoy is written in Go, the current RE might not work). [09:56:31] (nvm, this is not APIGW, but the question about the RE engine remains) [10:00:51] I think chnageprop uses the native Javascript engine, which support NLA: https://regex101.com/r/sfExl8/1 [10:06:33] (03PS1) 10Ilias Sarantopoulos: fix: nllb-200 src lang [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931569 (https://phabricator.wikimedia.org/T333861) [10:07:19] (03CR) 10Elukey: [C: 03+1] "lovely bug" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931569 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:08:17] going afk for lunch (a little earlier today) [10:08:25] \o [10:12:03] (03CR) 10Ilias Sarantopoulos: fix: nllb-200 src lang (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931569 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:16:18] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix: nllb-200 src lang [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931569 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:17:24] (03Merged) 10jenkins-bot: fix: nllb-200 src lang [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931569 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [10:28:08] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) >>! In T338471#8947818, @elukey wrote: > After a chat with @daniel on slack we realized that the endpoint is indeed used: ht... [11:01:00] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10daniel) Wuld it be possible to implement a compatibility layer, so the app can use the new service without any changes needed? Updating... [11:16:59] I’m back! [11:17:32] Hey! [11:17:40] We missed you :) How was teh trip? [11:24:40] Welcome back! [11:26:06] https://www.youtube.com/watch?v=QeRzLitmSog [11:28:51] ouch! figured out a bug when deploying to staging. Our recent change from lists to dicts in the value files results in merged results (expected behavior) which ends up requesting gpus for staging cc: elukey [11:30:06] ah, that might make scheduling difficult :D [11:30:10] which also means that we deploy all services declared in prod and staging. [11:30:20] is there a YAML way to "whiteout" the GPU request [11:32:15] I sent a patch that manually overrides it and sets it to amd.com/gpu: 0 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/931585 [11:32:37] not great, more of a quick workaround [11:33:30] +1'd [11:50:27] * klausman late lunch [11:52:46] I have SO many slack messages [11:56:23] * isaranto lunch! [12:08:05] chrisalbon_: o/ [12:08:15] isaranto: ah snap didn't think about that use case [12:08:56] isaranto: the gpu zero trick seem ok for the moment, but no idea what's best to be honest [12:16:29] Could we flip the logic? Define the common bits without the GPU requirement, and add it for the prod services? [12:20:21] Also, heads up: merging the changeprop update. will then test in staging and codfw before pushing to eqiad. Will keep Hugh in the loop. [12:21:03] klausman: no because the common bits are shared with all the isvcs [12:21:35] ah, damn [12:30:23] kevinbazira: we'd need to find a new name for the recommendation api, it seems from T338471 that we can't remove the old one [12:30:40] does anybody have any idea? recommendation-ng-api ? [12:31:20] I have a strong dislike for "ng" names [12:31:27] but it might be the only option [12:32:21] Unrelated: pushing the changeprop to eqiad now. Will keep a close eye on resources and logs for outlink pods [12:32:31] okok [12:33:35] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) @diego we probably need to figure out a path forward for this, namely: 1) review how the old recommendation-api works, what tr... [12:33:54] the recommendation bit is probably needed, since the repo etc.. is called the same [12:34:13] maybe only 'recommendation.discovery.wmnet' ? [12:34:46] changeprop: req rate definitely increasing in eqiad, good. [12:35:39] might recommendation need a qualifier? Like, what specifically makes recommendations for? Just in case another recom service shows up in a few months [12:36:43] not sure if needed, it can get stale if the scope changes etc.. [12:36:59] wmflabs's domain is 'recommend.wmflabs.org' [12:37:11] elukey: I am open minded to any names really. I also see your suggesting deprecation in T338471#8948653 so there might be no need to change the name. [12:37:59] kevinbazira: we cannot deprecate, the old service is used by our Android app, see the last posts [12:38:35] eventually it would be nice to deprecate, but probably we'll need to add something to the new service to support the Android use cases [12:38:42] so in theory we should: [12:38:43] 1) [12:38:51] add the new api calling it with a new name [12:39:03] 2) allow content translation to use it etc.. [12:39:14] 3) see if we can add functionalities and migrate folks using the old api to it [12:39:18] 4) deprecate the old one [12:41:09] elukey: I have seen the old posts, if your suggestion in T338471#8948653 is bought, then a deprecation might happen. [12:42:29] kevinbazira: nope, because we'd need to deprecate the current one before adding the new, and it would disrupt current users (the Wikimedia app) [12:42:51] we will only be able to ask folks to migrate over to a new thing [12:43:20] changeprop push seems all good. I see some increas in CPU usage (about doubling, but still waaaaay under limit (limit is 1s/s, we're currently at 20ms/s). Memory usage hasn't changed. [12:43:32] will update ticket accordingly [12:43:41] klausman: logs etc.. all good on the pods? [12:43:45] yep [12:43:48] super [12:43:56] more request, no errors I have spotted [12:44:40] elukey: another suggestion could be versioning instead of renaming [12:45:35] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10klausman) Change 930610 has been pushed to prod, so now we get the full feed from changeprop. CPU usage of the outlink pods ha... [12:45:44] like recommendation-v2? [12:48:24] klausman: yes, "old-name-v2" [12:50:34] so would oldname and oldname-v2 imply that they do semantically the same thing but differ in details? Becasuse that would work for the current situation, but not some future recom API that does something wildly different. OTOH, I have no idea how likely that is. [12:53:05] at this point I'd like to get some info from Research, this is not really great [12:53:22] :+1: [12:54:32] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10diego) >>! In T338471#8948653, @elukey wrote: > @diego we probably need to figure out a path forward for this, namely: > > 1) re... [13:00:44] elukey: the research team shared a name suggestion in T333893#8931138 [13:03:07] missed it, if they are happy with it then ok [13:03:51] but I think that the main problem remains, namely the scope of the new service [13:04:07] does it replace the new one? Will it be expanded in the future? [13:07:42] yes and yes, based on my current understanding of the migration project. [13:08:32] if the scope is expanded then adding "Translation" to the name may not be great [13:12:23] Since Diego suggested the new name in T333893#8931138, they will likely have more ideas for a good name. We might want to engage them. [13:17:06] elukey: when you have some time and brain bw, I'd like to hear your opinion regarding the k8s pod dashboards [13:39:03] elukey: also, I made a tool to extract some info from JWTs (API GW tokens) since we may need it for manually elevating user tokens: https://github.com/klausman/jwtdec [13:40:05] A python tool might be more easy for users, but I had most of the Go code already :) [13:52:25] klausman: sorry I was in a meeting, after the team meeting if you want! [13:52:33] sgtm [13:53:08] kevinbazira: so I had a chat with Miriam, and after re-reading https://phabricator.wikimedia.org/T308165#7983559 with the context that we have I noticed this sentence: [13:53:15] "For example, similar service with endpoints described/testable here. @bmansurov would know more about the history of that service and why contenttranslation is still using the python instance on cloud vps instead of the nodejs version on mediawiki." [13:53:35] from Miriam's recollection both services (nodejs and python) are super old, like gap finder [13:53:56] so maybe the nodejs app, what we call old, is already the service that content translation needs [13:54:06] even if there are some differences IIUC [13:54:13] so I'll try to ping Isaac next [13:55:17] sure [14:12:23] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) @diego in https://phabricator.wikimedia.org/T308165#7983559 @Isaac also mentioned this: > For example, similar service with en... [15:21:33] elukey: the req/s are now at about 30rps and plateauing, the CPU usage is at ~85ms/s for kserve, so we good. memory usage isn't moving beyond baseline noise. [15:26:21] aiko_: the change we discussed (meta) seems to have worked, but I am still seeing some errors in the log: https://phabricator.wikimedia.org/P49459 [15:29:28] 10Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648 (10elukey) Hi folks! The ML team has been working to add the Python service outlined in https://recommend.wmflabs.org to production on K8s, but we realized that anot... [15:31:39] very nice [15:31:50] kevinbazira: commented in https://phabricator.wikimedia.org/T293648#8949413 [15:32:37] those errors are from deleted pages e.g. https://en.wikipedia.org/wiki/Ministry_of_Parliamentary_Affairs_(Karnataka) [15:32:57] couldn't fetch revision [15:33:04] Ah, righto. [15:34:39] hmm but the page with the title The_National_Centre_for_Artificial_Intelligence exists, it redirected to https://en.wikipedia.org/wiki/National_Centre_for_Artificial_Intelligence [15:35:24] page "Emirates Development Bank PJSC" was deleted as well [15:38:54] Is this something we need to address somehow? [15:42:09] do you know where I can see the input events that cause these errors? [15:43:11] maybe we can add logging so that it will be handy as well in the future, sounds like an interesting +1 to have it [15:43:23] I think it might be on logstash, but I'm not sure. Hugh (nowlan) might know [15:44:43] elukey: yes [15:46:07] in the changeprop config, I already set page.is_redirect: false, so we should not see redirected pages [15:46:58] running an errand bbiab [16:21:03] aiko_: Hugh says the only way to really see the original Kafka message and resultant query made by changeprop is to copy one from the prod stream, feed it to the staging changeprop instance and stare at its logs [16:33:14] * elukey afk! [16:33:15] o/ [16:35:32] (03PS1) 10Ilias Sarantopoulos: llm: 8bit quantization [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/931648 (https://phabricator.wikimedia.org/T334583) [16:58:07] 10Machine-Learning-Team, 10Patch-For-Review: Host open source LLM (bloom, etc.) on Lift Wing - https://phabricator.wikimedia.org/T333861 (10isarantopoulos) The model https://huggingface.co/facebook/nllb-200-distilled-600M has been deployed on Lift Wing with and without GPU. In this raw form we have an average... [16:58:21] going afk , cu all tomorrow! [17:25:40] klausman: o/ ack [21:23:14] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Isaac) this is VERRRRRY exciting! thank you all! I took a look at the event table on Hive and did some basic quality checks and...