[07:37:53] hello! [08:57:23] morning! [09:12:03] hola! [10:28:28] (03PS1) 10AikoChou: readability: updates according to the new TRank model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) [10:30:17] (03PS13) 10Ilias Sarantopoulos: articlequality: update to ordinal regression from statsmodels [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055177 (https://phabricator.wikimedia.org/T360455) [10:32:21] aiko: o/ kserve 13.1 I missed that one, nice! [10:36:29] isaranto: released 4 days ago :) [10:37:12] yeeah fresh stuff [10:50:27] (03CR) 10Ilias Sarantopoulos: readability: updates according to the new TRank model (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059032 (https://phabricator.wikimedia.org/T369712) (owner: 10AikoChou) [11:43:09] isaranto: o/ thanks for the review :) [11:43:09] going to deploy rec-api on staging [11:47:48] o/ kevinbazira [11:48:48] ack [12:01:44] FIRING: LiftWingServiceErrorRate: ... [12:01:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=frwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:03:50] I'm on it! [12:06:44] RESOLVED: LiftWingServiceErrorRate: ... [12:06:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=frwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:07:07] the rec-api deployment is failing on LiftWing staging: https://phabricator.wikimedia.org/P67193 [12:07:11] it is pretty clear I solved it in 2 minutes right ? [12:07:12] :P [12:08:00] I'm digging in the logs though to see if this is the same case as it was [12:08:09] kevinbazira: do u need any help with that? [12:10:40] isaranto: yep. the deployment times out and rolls back to a previous state. [12:11:09] ok! lemme finish up with the log digging and I'm taking a look in 5 minutes [12:11:20] okok [12:20:29] well the alert was due to the same thing with extractor cache https://logstash.wikimedia.org/goto/295b4de42cd4afc93a0a32364746dbf1 (just pointing to a specific one , although there are multiple ones, comeing from ores-legacy) [12:21:09] kevinbazira: does helmfile diff work properly? [12:21:48] yes, the diff works as expected and shows the new changes from the most recent patch [12:23:46] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10035064 (10isarantopoulos) I've uploaded the model on swift and in the public [[ https://analytics.wikimedia.org/published/wmf-ml-models/arti... [12:23:56] ok, let me have a look [12:30:15] when I run `helmfile -e ml-staging-codfw diff` the config shows that a `uwsgi command` still exists yet I removed it from the staging config: [12:30:15] https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/recommendation-api-ng/values-ml-staging-codfw.yaml [12:30:30] the `uwsgi command` exists in the prod config but I didn't expect this to affect staging: [12:30:30] https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/recommendation-api-ng/values.yaml [12:37:36] so diff doesn't work as expected then :) [12:39:02] what happens is the following: the helmfile entries in the dep-charts are yaml dictionaries so the end result is the union of the dicts (prod + staging). In order to make it work as expected we need to override any values we want in the ml-staging-codfw.yaml [12:40:25] hmm w8 what I'm saying is true for inf-services, lemme check if this is the case with python-webapp [12:49:44] but iiuc this shouldn't affect the sync command but it would end up running a different command on the pod when it runs [12:57:41] kevinbazira: for the wrong entrypoint I think this should do it for now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1059071 [12:58:19] and we can remove it when we go to prod (it won't hurt if it stays there but it would be better to remove it so that the command from blubber is always used) [12:58:40] still looking at the failed sync command [13:00:33] ok. let's give this a shot. my understanding is that the container has the poetry entrypoint by default. [13:01:03] yes, exactly. but the prod values override and these values are also used in the staging deployment [13:01:37] sorry I misread your comment above [13:02:00] if we remove the command + args entries from prod and staging only then the default poetry entrypoint will be used [13:02:39] to sum up, defining a command in a deployment overrides the default entrypoint that the image has [13:03:46] if you look at the CI output the command from prod is overridden by the new value in staging https://integration.wikimedia.org/ci/job/helm-lint/19541/console [13:04:15] I'm trying a sync again but I don't think it will work [13:04:41] it worked! (I'm puzzled) [13:06:58] super! the pod is up and running. [13:07:13] let me query the api ... [13:09:33] docs working as expected when I run `curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/docs` [13:10:04] that's nice1 [13:10:07] *! [13:29:58] the translation enpoint is running: https://phabricator.wikimedia.org/P67194 [13:29:58] isaranto: thanks for your help with the deployment config issue. [13:30:23] nice work, happy to help anytime! [13:30:40] <3 [13:38:32] Morning all [13:43:57] Morning Chris o/ [13:46:54] 06Machine-Learning-Team: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465#10035486 (10kevinbazira) The modernized recommendation API has been deployed on the LiftWing staging. It is currently available through an internal endpoint that can only be accessed by tools tha... [14:01:09] o/ [14:45:04] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10035698 (10Isaac) For testing purposes, this API should be hosting the same model so should match LiftWing outputs: https://misalignment.wmcl... [15:58:02] (03CR) 10Ilias Sarantopoulos: [C:03+2] "After chatting with Isaac, I'm merging this so we can test it on experimental ns" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055177 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [16:01:14] (03Merged) 10jenkins-bot: articlequality: update to ordinal regression from statsmodels [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1055177 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [16:25:57] I have a patch for deploying the new articlequality in experimental ns in ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1059043 [16:25:58] It isn't urgent, I already deployed it and it works, so feel free to merge it when you review it [16:29:41] (03PS1) 10Ilias Sarantopoulos: (WIP) huggingface: add flash attention 2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059119 (https://phabricator.wikimedia.org/T371344) [16:30:50] (03CR) 10CI reject: [V:04-1] (WIP) huggingface: add flash attention 2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1059119 (https://phabricator.wikimedia.org/T371344) (owner: 10Ilias Sarantopoulos) [16:47:38] going afk folks, have a nice evening/rest of day o/ [21:24:31] 06Machine-Learning-Team, 07OKR-Work: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465#10036853 (10ldelench_wmf)