[04:07:10] (03CR) 10Ilias Sarantopoulos: [C:03+2] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [04:13:22] (03Merged) 10jenkins-bot: revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [04:22:07] Good morning! [04:26:16] things seem to be going much better over the last 9h - exactly after the replica increase [05:21:59] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1021056 [05:55:48] (03CR) 10Kevin Bazira: [C:03+1] "I've tested batch_model locally with both batch and non-batch requests and it works as advertised: https://phabricator.wikimedia.org/P6083" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [07:28:47] I'm going to deploy the new change to revscoring-damaging. This means that all the model servers in the namespace will be redeployed [08:05:46] hello folks, passing by to check the latency, seems fine! [08:05:59] at this point it seems not revscoring ending up into a weird state [08:06:13] but rev-ids causing high preprocess cpu time, stalling everthing etc.. [08:06:19] or maybe both [08:06:21] sigh [08:12:43] hey Luca, I'm currently deploying the new changes so that we can see the revids at least [08:14:26] oups, models are crashing on staging. I'm investigating and will report back. Note related with latest changes but there has been no new deployment since november, so looking at what has changed [08:15:09] error from logs -> https://phabricator.wikimedia.org/P60868 [08:33:10] Morning! [08:33:47] The circular import problem is odd in that I'd expect it to happen in off-LW testing as well. But then again, my familiarity with how the revscoring models work internally is limited. [08:35:39] morning Tobias! [08:36:06] I found the root cause! https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/981718 [08:36:26] that's what you get when you introduce a lot of changes [08:37:05] ah, I see. Is it just import reordering, or a new import? [08:39:08] import ordering. But I'm trying to fix it so that order doesnt affect us [08:42:22] 06Machine-Learning-Team: Fix revscoring model servers - https://phabricator.wikimedia.org/T362853 (10isarantopoulos) 03NEW [09:17:52] (03PS1) 10Ilias Sarantopoulos: revscoring: fix circular import [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021384 (https://phabricator.wikimedia.org/T362853) [09:24:05] 06Machine-Learning-Team, 13Patch-For-Review: Fix revscoring model servers - https://phabricator.wikimedia.org/T362853#9725578 (10isarantopoulos) This was caused by a change in the order of the imports. RevscoringModelMP depends on RevscoringModel and RevscoringModelType. There are 2 solutions: - import Re... [09:38:36] (03CR) 10Klausman: [C:03+1] revscoring: fix circular import [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021384 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [10:02:16] (03CR) 10Ilias Sarantopoulos: [C:03+2] revscoring: fix circular import [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021384 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [10:03:02] (03Merged) 10jenkins-bot: revscoring: fix circular import [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1021384 (https://phabricator.wikimedia.org/T362853) (owner: 10Ilias Sarantopoulos) [10:03:53] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9725696 (10klausman) a:03klausman [10:19:06] * klausman lunch [11:02:38] * isaranto lunch [12:01:04] ok revscoring-damaging in staging is fine now with the new changes [12:01:14] deploying them to codfw first and then to eqiad [12:01:58] :+1: [12:19:18] done both in eqiad and codfw [12:24:51] hello folks! [12:28:37] Bongiorno! [12:31:11] it took some time to deploy the logging change as I found an older bug [12:32:12] and I missed the typo in the logging patch. paylod instead of payload but we've got bigger problems so will fix it later [12:35:12] ouch sorry :( [12:35:26] sending a patch [12:35:40] it was my fault! [12:36:13] ah it is a typo in the error msg! [12:36:22] so everything works but it is mispelled right? [12:36:52] what bigger problems do we have? [12:39:20] yes yes everything works lets leave it. by bigger problems I mean just the increased latencies. not sth new! [12:40:48] yes yes ok I thought I missed some other horror :) [12:41:26] we also opened a task a while ago to figure out if some model server needed multiprocess, maybe some revscoring pods are a good candidate [12:46:46] yes we could try deploy ruwiki in ml-staging with mp [12:48:42] I noticed some increased latencies [12:48:42] ``` [12:48:42] INFO:root:JSON paylod for the request: {'rev_id': 137320697} [12:48:42] INFO:root:Function get_revscoring_extractor_cache took 26.5574 seconds to execute. [12:48:42] ``` [12:49:34] 06Machine-Learning-Team, 13Patch-For-Review: Fix revscoring model servers - https://phabricator.wikimedia.org/T362853#9726201 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [12:49:37] niceee [12:50:23] 26s is also "fast" compared to yesterday :D [13:18:00] klausman: thanks for the reviews! [13:21:03] elukey: np! also have a first go at an alert for LW services having lots of 500s (and 0s) [13:33:55] the knative-serving's chart change is basically a no-op, only few comments moved [13:34:00] but the config now seems cleaner [13:36:37] Agreed, it took me a bit to grok the model-value changes until I realized how much of it was just indentation [13:37:20] I'm also ignoring the mw-api-into-ro changes from c.laime for now; if you want me to have a look at them, lmk [13:40:21] nono I'll have to send more :D [13:51:15] good morning all [13:52:14] good morning! [14:09:44] o/ [14:47:14] Heyo Chris [15:03:58] * isaranto afk! bbl to wrap things up [15:18:24] ok so the final Istio config should be https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1021490 [15:25:58] and also filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021506 to fix the liftwing_staging's httpbb tests [15:55:52] (03CR) 10AikoChou: "Thank you for testing it and providing feedback! I added a limit on batch size. That's a really good point. After these changes are made i" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [16:21:28] going afk folks! Have a nice rest of the day! [16:22:18] have a nice evening, Luca [16:23:54] I'm back for a bit! [16:23:59] ciao Luca o/ [16:41:12] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice! I like the idea of having 1 endpoint/deployment that does both." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [16:43:39] Llama3! https://llama.meta.com/llama3/ [16:44:28] I went afk for an hour and so many things happened already! [16:52:30] eventually logging off for the evening folks. cu tomorrow o/ [17:07:43] bye Ilias! [17:15:47] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9727434 (10kevinbazira) I have deployed the logo-detection model-server in the experimental namespace on LiftWing staging. On checking the pod, I noticed it was not st... [17:19:31] 06Machine-Learning-Team, 13Patch-For-Review: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749#9727438 (10kevinbazira) I have reviewed the logs and found that the `CrashLoopBackOff` error is occurring because the model-server lacks the necessary permissions to l... [17:27:25] klausman: o/ whenever you get a minute tomorrow, please help resolve this issue where the logo-detection model-server deployed in the experimental namespace on staging doesn't have permission to load the model: https://phabricator.wikimedia.org/T362749#9727438 thanks in advance! [17:27:59] will do. I suspect /mnt/models doesn't have the right perms [17:29:01] great. thanks! [17:38:30] Currheading out now, g'night everyone [19:35:30] night all!