[05:16:07] Good morning o/ [07:48:06] kevinbazira: o/ if you're working on outlink as well let's sync here [07:48:33] isaranto: o/ sure sure [08:08:13] I deployed an isvc in experimental so I can play around [08:09:38] so the fact that the args are there it means they should be read on start by the transformer [08:10:05] great. if the tranformer works then we are good [08:10:29] I didnt say it works :) [08:13:02] in the service I deployed I changed the entrypoint command to `python3 transformer/transformer.py --predictor_host outlink-topic-model-predictor.experimental` and it worked fine. I then changed the predictor host to some dummy value liek `sadada` and it worked again as the value that is used is the one specified in args [08:15:04] I'm building an image that sets the cmd arg required=False to check if this would work. Even if it does it is good for us to dig a bit and understand how the transformer service is built. In my understanding and by looking at kserve code the values are created automatically. will look at it after I finish with these tests [08:28:08] yep. setting the `predictor_host` and `model_name` explicitly is how I got it to work locally using the venv and Makefile. it was interesting to see that these weren't set in prod and it was still working there. [08:31:55] hello folks o/ [08:33:44] hi Aiko! [08:36:11] I made a patch to test my assumption for the dummy predictor_host in ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1055384 [08:43:38] hmm no diffs found. lemme recheck [08:46:22] what happened to the articletopic-outlink? I saw it was firing yesterday, is it back to functioning? do we have a report of the issue, or are we investigating? sorry I just came back [08:46:39] no worries, welcome back! [08:47:01] there is an issue, a summary of which you can find here https://phabricator.wikimedia.org/T370408#9995559 [08:47:35] nice thank you <3 [08:48:06] I'm available to jump on a call anytime if you wish [08:48:49] aiko: 早上好! [08:50:35] I figured out why there is no diff. the container config is not added in the kserve-inference chart, I'm filing a change to include these [08:50:56] buongiorno Luca o/ [08:52:47] let me look into it first [08:53:02] 早上好, Luca :D [08:53:32] isaranto: kalimera :D [09:00:34] aiko: is it fine to just say 早好? [09:03:09] elukey: no, but it is fine to just say 早! [09:07:44] lol ok [09:07:52] my intuition is still not great :D [09:08:47] I filed a change for transformer container in the kserve-inference chart if anyone wants to take a look https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1055389 [09:10:09] Luca I see you're getting better and better at Mandarin! [09:10:44] I'd like to study as well at some point, seems like a big challenge [09:11:59] so the error happened after the change of migrating to src dir, right? [09:13:53] yes, but there are more changes that weren't deployed [09:17:07] yeah I get it, the last deployment was last year [09:22:44] aha! I found it [09:23:59] what is that? :D [09:24:30] should have figured it out earlier. So the cmd args are there and they are passed to the command that runs on docker. But we are using the entrypoint.sh (model_server_entrypoint.sh) script which doesn't accept additional arguments [09:25:00] so they are passed to the command but cannot be parsed by the script and so are not provided to `python3 transformers/transformers.py` [09:26:02] at least this is my current understanding! [09:26:15] going to try and validate it [09:28:17] so that would be also the reason why the isvc I put in experimental worked (you can read about it above, I still haven't written about this in the task) [09:31:54] yep that is it. So what we need to do it to allow the model_server_entrypoint.sh to read all the additional cmd arguments https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/model_server_entrypoint.sh#12 [09:37:34] so the predictor_host arg is set automatically and passed by kserve predictor, but in the model_server_entrypoint.sh for transformers we didn't accept additional args, correct? [09:37:51] exactly [09:38:53] and this is also seen in the logs https://phabricator.wikimedia.org/T370408#9994904 [09:38:53] where the command that runs is MODEL_SERVER_PATH=transformer/transformer.py [09:39:13] (03PS1) 10Ilias Sarantopoulos: add additional arguments in model_server_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) [09:39:48] (03PS2) 10Ilias Sarantopoulos: add additional arguments in model_server_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) [09:41:28] since this affects all model servers we need to be careful and update the services one by one in ml-staging to make sure that everything works as expected [09:43:25] ah, and since kserve automagically generates the option, it's nowhere to be found in the deployment-charts repo [09:43:52] (03PS3) 10Ilias Sarantopoulos: add additional arguments in model_server_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) [09:44:11] (03CR) 10Klausman: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [09:44:23] Nice find, Ilias! [09:44:53] thanks, a fresh look always helps! [09:45:01] yeah, agreed [09:46:07] (03CR) 10AikoChou: [C:03+1] add additional arguments in model_server_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [09:46:37] I also rebuilt the predictor image and all seems well [09:49:17] I suggest we just deploy this in ml-staging outlink, verify that it works and then update any other model servers after the LLM sprint week so we don't have any distractions. would that be ok? [09:53:59] SGTM. I'd even be willing to deploy it to prod on Monday and keeping an eye on things. [09:55:57] that is ok as well. I just wouldn't want to start deploying all the services. if it is just outlink I can even give it a try today [09:56:26] (03CR) 10Kevin Bazira: add additional arguments in model_server_entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [09:56:29] Yeah, agreed. [09:56:39] Do we have "customers" for outlink yet? [09:57:36] yes we have customers [09:57:57] via streaming [09:58:18] (03CR) 10Ilias Sarantopoulos: add additional arguments in model_server_entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [10:01:56] I'd prefer to verify that it works in staging today but deploy to prod after the LLM sprint [10:04:03] ack. [10:04:39] At least the update is not needed to unbreak something else, so we're not under too much time pressure. [10:05:27] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9997946 (10isarantopoulos) The issue was caused because the cmd args that are passed to the container are not parsed by the `model_server_entrypoint.sh` script which is the e... [10:05:40] (03CR) 10Kevin Bazira: [C:03+1] add additional arguments in model_server_entrypoint.sh (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [10:10:19] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9997967 (10klausman) One side note: the reason we never found any mention of `--predictor_host` in the deployment charts, despite seeing it in the (old) production deployment... [10:10:42] * klausman lunch [10:10:51] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] add additional arguments in model_server_entrypoint.sh [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055392 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [10:11:49] let's do anything prod related after the LLM sprint to avoid any friction then! [10:12:37] just to clarify that we didn't have any outage, the new deployment failed but the old one was there serving traffic. So all we did was revert the image update [10:13:03] yep, I think nobody but us noticed [10:13:26] (in stark contrast to a certain other worldwide IT outage...) [10:14:18] that is wiiiiiiiild [10:14:31] lol [10:28:52] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9997999 (10kevinbazira) [10:31:11] oh, no image builds were triggered [10:42:48] that is reasonable as it isn't included as a trigger file/path for the pipeline build [10:43:53] I'll be issuing a dummy patch to build outlink images, but we'll need to implement this change for integration/config for all images (build images whenever the model_server_entrypoint.sh) changes [10:46:00] (03PS1) 10Ilias Sarantopoulos: outlink: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055407 (https://phabricator.wikimedia.org/T370408) [10:46:48] (03CR) 10Kevin Bazira: [C:03+1] outlink: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055407 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [10:53:00] (03CR) 10Ilias Sarantopoulos: [C:03+2] outlink: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055407 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [10:56:32] (03Merged) 10jenkins-bot: outlink: trigger new build [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055407 (https://phabricator.wikimedia.org/T370408) (owner: 10Ilias Sarantopoulos) [11:11:38] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#9998128 (10isarantopoulos) The service is up and running in staging and works as expected. The production service is still running an older image that works fine and as this... [11:11:58] deployed in staging, all good! [11:38:10] nice! good job :) [11:38:17] \o/ [11:52:36] teamwork! [12:51:52] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#9998349 (10Isaac) @isarantopoulos your summary is accurate - thanks! We can adjust model card schema if it's non-standard and you want to adju... [12:54:28] let me know if anyone needs anything else for today [12:54:32] I plan to finish some work I'm doing on articlequality and log off a bit earlier today [12:56:22] I've been building the wrong image for the last half hour and wondering why I don't see the correct contents in it [13:07:39] Morning all! [13:10:22] oh hey chris [13:13:08] o/ Chris [13:39:48] chrisalbon: the word of the day is "Crowdstrike" :) [13:47:14] FIRING: ErrorBudgetBurn: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:48:01] hi Chris! [13:52:24] Investigating ^^^ [13:53:19] the ErrorBudgetBurn stuff is from Pyrra, Keith has deployed the config for enwiki-articlequality [13:53:25] ah [13:53:58] we don't really have (yet) those alarms, I think they wanted to verify holes in the recording rules etc.. [13:54:14] follow up with Keith in case, not sure what plans they have [13:55:37] ack [14:04:52] ack [14:05:17] Going afk folks,have a nice weekend! [14:08:13] \o enjoy! [14:12:07] have a nice weekend Ilias! [15:15:54] I'm heading out as well, the heat is killing me [15:16:04] \o enjoy your weekends, everyone. [15:39:37] by Tobias :) see you next week o/