[06:47:08] Guten tag! [06:49:05] :) goedemorgen! [06:51:30] günaydın! [06:58:50] günaydın :) efharisto kalimera [07:22:44] :D [08:38:50] 06Machine-Learning-Team, 10EditCheck: Retrain peacock detection model for production use - https://phabricator.wikimedia.org/T388211#10728911 (10OKarakaya-WMF) hey, I don't know if this is the right place to share some thoughts but I have some suggestions. - Dataset insights: - Can we generate some insigh... [09:35:44] FIRING: LiftWingServiceErrorRate: ... [09:35:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:45:44] RESOLVED: LiftWingServiceErrorRate: ... [09:45:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:12:48] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10729123 (10OKarakaya-WMF) hi @Jelto @achou and I... [10:56:15] * isaranto lunch [11:03:11] whenever anyone gets a minute, please review: [11:03:11] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1135054 [11:03:11] and [11:03:11] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1135153 [11:03:11] thanks! [11:33:12] kevinbazira: o/ I'm puzzled why we're getting canary event errors.. do you have any idea? [11:40:45] aiko: o/ not sure why. looking at Ben's comment, the error is: [11:40:45] ``` [11:40:45] "event 89441aea-dd60-4af4-97ae-e08abc8bafe5 of schema at /mediawiki/page/prediction_classification_change/1.1.0 destined to stream mediawiki.page_revert_risk_prediction_change.v1 is not allowed in stream; mediawiki.page_revert_risk_prediction_change.v1 is not configured." [11:40:46] ``` [11:40:46] since this was merged: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133603 [11:40:46] it's likely caused by the changeprop patch not being merged? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1135153 [11:40:46] it's reviewed. merging now ... [11:42:06] aiko: now that the above changeprop patch is merged, I'll deploy name change on LW soon as you review it: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1135054 [11:50:02] ack, +1'd! [12:00:32] thanks! going to deploy now ... [12:30:59] thanks for your help, Hugh! [12:39:37] kevinbazira: could you follow up with Ben and Andrew if the canary event errors have gone after these changes deployed? [12:40:12] hnowlan: o/ thanks! [12:44:43] aiko: sure sure [12:45:15] meanhwile, RRLA is failing to send events to EventGate: https://phabricator.wikimedia.org/P74835 [12:45:15] we experienced this issue with article-country and it was resolved by restarting EventGate: https://phabricator.wikimedia.org/P73242#293629 [12:53:56] do you need help with that? [12:58:29] ooh so the reason is we haven't restarted eventgate-main. I thought we did. Thank you Luca! :) [13:12:35] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10729616 (10kevinbazira) Using the same ROCm vLLM container from T385173#10726495, I was able to run inference for both the `aya-expanse-8b` and `aya-expanse-32b` models as shown below: =====aya-... [13:15:00] on a brighter note, I was able to run inference on both `aya-expanse-8b` and `aya-expanse-32b` models using the ROCm vLLM image on ml-lab: https://phabricator.wikimedia.org/T385173#10729616 [13:15:00] the inference speed is slow for `aya-expanse-32b`. I'll look into using multiple GPU inference to see how that performs. [13:16:57] great work Kevin! could you try requests with different # of output tokens to report the latency? [13:18:58] also I'm thinking that before looking in a multiGPU serving setting (which we can't really support at the moment) we should look into some benchmarking so that we can report throughput (tokens/second) total rps etc [13:19:02] wdyt? [13:19:17] should we look into the AMD benchmark for that? [13:24:20] (03PS1) 10Ilias Sarantopoulos: articlequality: add async requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1135721 [13:25:02] (03PS2) 10Ilias Sarantopoulos: articlequality: add async requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1135721 [13:28:32] isaranto: okok I'll look into benchmarking next ... [13:34:29] ack [13:35:16] I submitted a patch that is an improvement for the articlequality service -- I will add more info from the load tests I've ran locally and on ml-staging [13:35:25] I put it as WIP for now and will request a review tomorrow [14:40:03] 06Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata.org: Move Wikidata tools to Lift Wing - https://phabricator.wikimedia.org/T343419#10730044 (10karapayneWMDE) https://item-quality-evaluator.toolforge.org: updated Oct 19, 2023 https://github.com/wmde/wikidata-constraints-violation-ch... [16:05:19] * isaranto afk [17:28:37] 07artificial-intelligence, 10Lift-Wing, 06Machine-Learning-Team, 07Documentation: Create a tutorial for deploying a model on toolforge - https://phabricator.wikimedia.org/T281317#10731021 (10TBurmeister) [17:32:55] (03PS10) 10AikoChou: edit-check: add SHAP values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) [17:33:03] (03CR) 10CI reject: [V:04-1] edit-check: add SHAP values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [17:42:52] (03PS11) 10AikoChou: edit-check: add SHAP values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) [18:02:31] (03CR) 10AikoChou: "Hey Kevin, thanks for spotting it. I updated the code and this problem is fixed. :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)