[07:01:06] <isaranto>	 o/
[08:56:45] <isaranto>	 I was checking eswiki damaging alert. Same thing...
[08:57:41] <isaranto>	 I'm thinking to take Luca's suggestion and increase min replicas to 2 for the services that have issues and discuss tackling the issue as well
[08:57:56] * isaranto early lunch and errand
[09:26:48] <klausman>	 Morning!
[09:44:45] * klausman lunch as well
[09:52:32] <aiko>	 finally have a version for handling redirects in mwapi.. https://github.com/AikoChou/python-mwapi/commit/cdf6eabc99c2e2d136ef54514f65ec7353e95a35
[09:52:53] <aiko>	 add **request_params in case in the future we want to pass other params to the request
[10:02:46] <aiko>	 thinking to test it in revscoring models first because it uses session.get directly. revertrisk model makes request via knowledge integrity..
[10:24:27] <aiko>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1038736
[10:24:33] <aiko>	 ---^ if anyone has time
[10:39:29] <isaranto>	 o/ aiko! I'll review and test it!
[10:41:04] <aiko>	 isaranto: +1 increase min replicas to 2 for now. the errors happen quite often recently.. and we could discuss the issue today in the meeting
[10:47:04] <aiko>	 isaranto: ohh regarding the redirect patch, I need to first test it and I'll open a MR afterwards and ask for your review :)
[10:47:21] <isaranto>	 ok!
[11:18:04] * aiko lunch!
[11:22:41] <moritzm>	 klausman: FYI, there's a new round of kernel reboots, list of ML hosts is at https://phabricator.wikimedia.org/T366555 (all kernels/microcode updated, fleet-wide, "only" needs the reboots)
[11:38:54] <chrisalbon>	 Good morning all
[11:41:21] <klausman>	 moritzm: roger!
[11:41:36] <klausman>	 hey Chris!
[11:42:24] <isaranto>	 o/ Chris
[12:24:15] <elukey>	 aiko: o/ re: python-mwapi - another option is to avoid "force_http" in mwapi, and just set allow_redirects=False. Then the logic of retrying etc.. would be stored in inference-services' repo, to separate concern. Both approaches are good, but if you add "force_http" in mw-api be sure to add docs about it since without context the parameter may be confusing (like people asking Why do we need to force 
[12:24:21] <elukey>	 http?)
[12:27:59] <elukey>	 isaranto: re eswiki - one thing that we could do is to set revscoring-mp for problematic model servers (like eswiki, viwiki) and offload preprocess() to a separate Python process. We could also think about doing it selectively, when "size" returned by the MWAPI is above a certain thresold (so we don't have to pay the serialization/deserialization penalty for "quick" rev-ids)
[12:32:19] <isaranto>	 elukey: ack, I can do that in staging and test results
[12:32:38] <isaranto>	 I mean test with current problematic rev id
[12:33:36] <elukey>	 makes sense yes, but the key I think it is doing it selectively, otherwise for regular workload we'll pay a lot of latency
[12:40:23] <isaranto>	 I was thinking to start with viwiki
[12:49:19] <isaranto>	 this is what I mean https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1038765
[12:51:12] <wikibugs>	 (03PS1) 10Elukey: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336)
[12:51:30] <elukey>	 isaranto: --^ this is an extra bit that I meant aboce
[12:51:33] <elukey>	 *above
[12:51:37] <elukey>	 needs to be tested of course
[12:53:08] <elukey>	 the main issue with PREPROCESS_MP (that we observed in the past) was that serialization/deserialization took more than preprocess() and process() sometimes
[12:53:36] <klausman>	 what's the serialization format? pickle? Or JSON?
[12:53:39] <elukey>	 so we could offload preprocess selectively, to allow quick rev-ids to be executed in the main loop
[12:53:42] <elukey>	 pickle
[12:54:05] <elukey>	 process to process message passing basically
[12:54:11] <klausman>	 Hm, unlikely to find something much faster unless we'd go superfancy, like protobuf
[12:56:14] <isaranto>	 elukey: okk, now I got it. this is much better
[12:56:41] <elukey>	 isaranto: does it make sense? I am not 100% sure that "size" can be accessed in that way, but it should be available
[12:57:12] <elukey>	 we could tune the MWAPI_REVID_CONTENT_THRESHOLD_BYTES based on the use cases
[12:57:50] <elukey>	 klausman: ray does something like that, IIRC kserve supports it natively (sort of), but when Aiko checked IIRC it needed a separate server process to handle all the heavy lifting
[12:58:03] <elukey>	 not sure if today things are better
[12:58:28] <elukey>	 we use python's built-in message passing rather than ray
[13:03:43] <klausman>	 yeah, it's probably worth just the smaller maintenance overhead
[13:08:43] <isaranto>	 it makes sense, I too don't know if size can be accessed that way, but I can check
[13:32:51] <isaranto>	 I'm building the above patch locally to test it
[13:35:53] <elukey>	 <3
[14:28:21] <wikibugs>	 06Machine-Learning-Team: Test Revert Risk model with the transparent config - https://phabricator.wikimedia.org/T366250#9859905 (10isarantopoulos) a:03achou
[14:47:36] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428#9859995 (10klausman) 05Open→03Resolved
[14:56:15] <aiko>	 elukey: yeah it's a bit weird to have force_http in mwapi, probably we are the only ones who need it lol. I'll look into the option you mentioned!
[15:05:17] <elukey>	 aiko: nono the solution is good as well, it encapsulate all the logic in there, maybe we could check if the logic is needed only in mw-api or elsewhere too
[15:05:35] <elukey>	 if so, it may be better to have a separate util in inference-services
[15:05:41] <elukey>	 so we'll reuse/dry more code etc..
[15:14:52] <aiko>	 ok sounds good :)
[15:20:44] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#9860217 (10isarantopoulos) We have added request payload validation with pydantic and currently adding more models to the package.
[15:23:31] <isaranto>	 aiko: this is the error I was getting in HF for Mistral https://phabricator.wikimedia.org/T365246#9835940
[15:26:43] <isaranto>	 I haven't had time to look into but imo there is not much to debug on our side since we are using third party code. So I would just look if a) the new mistral version works b)if there is a fix in kserve or transformers package (or fastapi) which would be solved by an update
[15:26:55] <isaranto>	 for now I'm just focusing on using a 7B model that would work
[15:27:00] <isaranto>	 out of the box I mean
[15:58:17] <aiko>	 isaranto: ack! thank uuu
[16:22:33] <isaranto>	 elukey: I tested the patch you provided earlier and I made it work with some modification so that we can access the "size" field properly
[16:22:38] <isaranto>	 this is the change https://phabricator.wikimedia.org/P64026
[16:22:47] <isaranto>	 shall I modify the patch directly?
[16:22:58] <elukey>	 ah nice yes please!
[16:35:56] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[16:36:16] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[16:36:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[16:36:59] <isaranto>	 I submitted the above --^ but need to test it a bit more to make sure no errors fall through the cracks
[16:37:48] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: revscoring_model: inspect mw-api-cache for MP preprocess [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1038766 (https://phabricator.wikimedia.org/T363336) (owner: 10Elukey)
[16:38:01] <isaranto>	 going afk folks o/
[16:40:07] <elukey>	 o/
[16:40:19] <elukey>	 lemme know if you want to brainbounce about testing during the next days
[16:48:11] <isaranto>	 Ok,thank you!
[17:07:35] <aiko>	 o/
[21:32:09] <wikibugs>	 06Machine-Learning-Team: Using LiftWing on non wikimedia wikis - https://phabricator.wikimedia.org/T366654 (10Nicolas_NALLET) 03NEW