[06:26:22] good morning! [08:10:47] isaranto: o/ [08:10:47] In the locust load tests (see link below), I noticed `.input` is used on `.tsv` files. Is there a specific reason for this? [08:10:47] https://github.com/wikimedia/machinelearning-liftwing-inference-services/tree/main/test/locust/inputs [08:10:47] I am thinking using `.tsv` just like we do in the data dir for model-servers would be more consistent. wdyt? [08:15:37] o/ no reason I just copy pasted the input files used in wrk (so kept the same name). You're right though having the filetype suffix (tsv, csv etc) is much better and more descriptive [08:17:21] I'm not sure why we were using tsv and not csv in the first place (could have sth to do with original wrk code expecting data in that format) [08:17:54] thanks for the clarification. I'll prepare a patch to fix this. [08:30:44] FIRING: LiftWingServiceErrorRate: ... [08:30:49] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=fiwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:34:09] noooo [08:47:20] getting timeouts through API GW [08:47:21] ``` [08:47:21] curl https://api.wikimedia.org/serv [08:47:21] ice/lw/inference/v1/models/fiwiki-damaging:predict -X POST -d '{"rev_id": 12345}' [08:47:21] ``` [08:47:38] cpu seems throttled https://grafana.wikimedia.org/goto/K5UpRtmHR?orgId=1 [08:51:35] kevinbazira: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1080555 I enabled mp for fiwiki-damaging (was the one that had outage on a weekend ~2 weeks ago [08:52:42] could you deploy that to eqiad plz as I'm ready to go in an meeting? Thanks! [08:53:51] okok. had +1'ed but let me deploy now ... [08:54:51] Thanks! [09:05:44] RESOLVED: LiftWingServiceErrorRate: ... [09:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=fiwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:11:57] fiwiki-damaging pods are up and running after deployment: https://phabricator.wikimedia.org/P70135 [09:58:52] back! [09:59:10] kevinbazira: great! thanks for deploying this [09:59:28] np [10:30:00] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10233107 (10achou) The reference-risk model is now in production! It's paired with the reference-need model under a single service called "reference-quality". This coupling ref... [10:31:48] (03PS1) 10Kevin Bazira: locust: add article-country load test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1080592 (https://phabricator.wikimedia.org/T371897) [10:44:54] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10233144 (10achou) A new reference-need model has been deployed to production. This model uses a distilled version of multilingual BERT and dynamic quantization, which impr... [10:44:56] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10233146 (10achou) 05Open→03Resolved [10:44:58] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10233147 (10achou) 05Open→03Resolved [10:45:00] 10Lift-Wing, 06Machine-Learning-Team: Log and export preprocess size in inference services as a prometheus metric - https://phabricator.wikimedia.org/T374034#10233148 (10achou) a:03achou [11:09:10] (03CR) 10AikoChou: [C:03+1] "Thanks for working on this, Kevin! It would be nice if we started adding a results table for easier reading, like the reference quality mo" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1080592 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:12:39] Good morning all [11:18:41] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] "Great. Thank you for the suggestion. I'll push a patch for this." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1080592 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [11:23:01] morning Chris o/ [12:03:12] \o [13:59:47] 06Machine-Learning-Team: the error message from gapfinder service and the hatnote on mediawikiwiki both refer to a deleted rev - https://phabricator.wikimedia.org/T377331 (10jeremyb-phone) 03NEW [14:07:57] 06Machine-Learning-Team: the error message from gapfinder service and the hatnote on mediawikiwiki both refer to a deleted rev - https://phabricator.wikimedia.org/T377331#10233687 (10jeremyb-phone) example was already removed from README https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/... [14:10:43] 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10233701 (10jeremyb-phone) [14:16:02] 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10233734 (10jeremyb-phone) a more recent hatnote refers to https://fa.wikipedia.org/wiki/Special:Diff/39750822 so only the error message needs fixing still, maybe futureproof wit... [14:22:24] 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10233754 (10Isaac) thanks for reporting -- in trying to update I may have taken the service down and I'm not fully sure why. I'll look into it further but if it turns out to be a... [16:18:05] going afk folks, have a nice evening/rest of day [16:55:35] night Ilias! [17:06:54] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10234882 (10diego) Thanks @achou , and also to @Aitolkyn and @MunizaA , you all did amazing work on making this model faster! The speedup is really impressive and you used... [18:37:09] 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10235344 (10Isaac) seems to up again with the new message now. I appreciate the suggestion for the more long-term fix but I'd actually like to take this down in the not-so-distant...