[04:41:40] (03PS1) 10Santhosh: Avoid duplicate slash in URLs for cxserver API [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061662 (https://phabricator.wikimedia.org/T371465) [06:13:46] (03CR) 10Kevin Bazira: [C:03+2] "LGTM!" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061662 (https://phabricator.wikimedia.org/T371465) (owner: 10Santhosh) [06:14:26] (03Merged) 10jenkins-bot: Avoid duplicate slash in URLs for cxserver API [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1061662 (https://phabricator.wikimedia.org/T371465) (owner: 10Santhosh) [06:19:13] Good morning o/ [06:19:27] Back for a week! [08:12:04] Mornign Ilias, hope you got some pre-relaxing going last week :) [08:14:29] (03PS1) 10Kevin Bazira: Makefile: update readability model path for local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061948 (https://phabricator.wikimedia.org/T369712) [08:22:50] morning Tobias! I did mentally at least as I hiked a ton [08:24:02] came back to wildfires near Athens, atmosphere is awful outside :( [08:27:16] Aw. Do people mask for the PM2.5 etc? [08:29:09] most ppl do nothing as far as I've seen. I just wear a mask when going outdoors [08:30:11] (03CR) 10Ilias Sarantopoulos: [C:03+1] Makefile: update readability model path for local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061948 (https://phabricator.wikimedia.org/T369712) (owner: 10Kevin Bazira) [08:30:51] kevinbazira: o/ thanks for adding the articlequality model in the Makefile! [08:31:44] FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=ores-legacy&var-backend=ores-legacy-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:32:15] and here we go [08:32:27] isaranto: o/ welcome back! :) [08:32:31] klausman: o/ [08:32:57] "good morning" from alert manager [08:32:58] Hey kevin :) [08:33:19] (03CR) 10Kevin Bazira: [C:03+2] Makefile: update readability model path for local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061948 (https://phabricator.wikimedia.org/T369712) (owner: 10Kevin Bazira) [08:33:20] isaranto: ores-legacy is serving 500s :-/ [08:34:10] it seems accessible though [08:34:52] | ValueError: invalid literal for int() with base 10: '299380.0' [08:35:01] I'll pastebin the backtrac [08:35:34] https://phabricator.wikimedia.org/P67271 [08:36:21] `GET /v3/scores/enwiki/?models=drafttopic&revids=299380.0` <- someone is querying the API with a float-like number as a revid [08:36:54] thanks, I'm on it! [08:36:59] I guess we should be handling that more gracefully with a 400 or sth [08:37:18] yeah I agree, I'm going to make it a 400 ok? [08:38:35] Yeah, that's probably best. I'll see if I can find a better-fitting code [08:39:53] Yeah, 400 is best [08:40:38] we could transform this to an integer and serve the request but I think it is best to return a 400 as this should be fixed by the client making the request [08:41:56] Yes, agreed [08:42:29] Because if you accept 100.0, you sould also accept 1.0e2 [08:42:39] And that way, madness lies. [08:43:25] yup. tbh I should have added some validation from the beginning [08:48:36] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] Makefile: update readability model path for local-run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061948 (https://phabricator.wikimedia.org/T369712) (owner: 10Kevin Bazira) [08:56:43] (03PS1) 10Ilias Sarantopoulos: (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [08:58:05] ok, it seems that only the specific path that doesn't have validation for integer revids. for example this returns a 422 https://ores.wikimedia.org/v3/scores/enwiki/12345.0/articlequality [08:58:23] (03CR) 10CI reject: [V:04-1] (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [09:04:41] (03PS2) 10Ilias Sarantopoulos: (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [09:05:28] (03CR) 10CI reject: [V:04-1] (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [09:09:05] (03PS3) 10Ilias Sarantopoulos: (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [09:09:22] I'm testing the above, will ping you when I'm ready for a review! [09:10:54] roger [09:48:26] (03PS4) 10Ilias Sarantopoulos: (WIP) ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [10:14:37] (03PS5) 10Ilias Sarantopoulos: ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [10:15:17] ready! [10:15:22] Looking [10:15:24] (03CR) 10CI reject: [V:04-1] ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [10:16:13] ah, formatting, ecveryone's fave CI failure [10:16:32] (03PS6) 10Ilias Sarantopoulos: ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 [10:17:00] :) [10:17:34] (03CR) 10Klausman: [C:03+1] ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [10:17:56] LGTM! [10:22:50] (03PS2) 10Santhosh: Add support for using both topic and seed filters [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1060333 [10:23:18] * klausman lunch [10:35:17] (03CR) 10Ilias Sarantopoulos: [C:03+2] ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [10:35:35] I'm going to deploy it to resolve the issue [10:35:49] here you can see the requests causing the issue -> https://logstash.wikimedia.org/goto/d4e15a3929aa61c6d7157c0de7bbbe00 [10:36:07] (03Merged) 10jenkins-bot: ores-legacy: validate integers in revids list [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1061953 (owner: 10Ilias Sarantopoulos) [10:41:56] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1061970 [11:05:58] deployed staging and prod - everything seems ok https://ores.wikimedia.org/v3/scores/enwiki/?models=drafttopic&revids=299380.0 [11:06:44] RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=ores-legacy&var-backend=ores-legacy-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:07:47] thanks for the reviews! [11:15:29] * isaranto afk lunch! [11:28:52] 06Machine-Learning-Team, 10Structured-Data-Backlog (Current Work): [SPIKE] Send an image thumbnail to the logo detection service - https://phabricator.wikimedia.org/T364551#10057261 (10mfossati) 05In progress→03Resolved ` curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' 'https://commons.wik... [11:43:34] 06Machine-Learning-Team, 10MW-1.43-notes (1.43.0-wmf.17; 2024-08-06), 07OKR-Work: Deploy Modernized Recommendation API to LiftWing - https://phabricator.wikimedia.org/T371465#10057281 (10kevinbazira) @santhosh, thank you for pushing the fix to remove a trailing slash from the cxserver API. We have deployed t... [12:14:18] I've looked around logstash, and of course the agent is YourAppName/1.0 (yourname@example.com) [12:14:37] Maybe we should just always deny requests with that UA? To force people to actually populate it [12:17:33] 06Machine-Learning-Team: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#10057375 (10kevinbazira) [12:17:56] I'm trying to find the doc reference for UA. iirc it is the recommended way to access LW so that you can get support [12:18:25] Remember that we also have https://foundation.wikimedia.org/wiki/Policy:User-Agent_policy, so if any external user don't follow it we are ok to block/rate-limit [12:18:50] hi Luca o/ thanks that was exactly what I was looking for! [12:22:27] o/ [12:30:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:30:49] Deployment nllb-200-gpu-predictor-default-00010-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=nllb-200-gpu-predictor-default-00010-deployment - ... [12:30:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:31:14] taking a look [12:32:55] I suggest we just remove the nllb deployments since they aren't being used [12:33:03] Ah, it's trying to run a second replica, but we of course have only one GPU [12:33:09] ack [12:33:36] mh, actually we have two on thet macuine [12:37:14] ah, we're out of CPU [12:37:25] 32m Warning FailedScheduling pod/nllb-200-gpu-predictor-default-00010-deployment-f6f4f4dc8-ktlmn 0/10 nodes are available: 2 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 Insufficient amd.com/gpu. [12:40:01] Got it to schedule on 1001 by cordoning it, bouncing a different pod running there, then uncordoning 1001 [12:40:31] I would have expected nllb being able to evict a non-GPU pod in this scenario [12:43:27] ack [12:43:52] wdyt about deleting these deployments? [12:44:17] I'll check if they have any traffic but I doubt it [12:45:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:45:49] Deployment nllb-200-gpu-predictor-default-00010-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=nllb-200-gpu-predictor-default-00010-deployment - ... [12:45:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:56:00] isaranto: which deployments? nllb? [12:56:17] yes nllb-cpu and gpu [12:56:44] If they're unused, we should probably delete them. At least in prod [12:57:01] I'll make a patcj [12:57:03] patch* [12:57:38] I can do it, I was just asking if there was any objection to it [12:58:15] eh, it's just a 2-file delete, I can manage that even in this heat :) [12:58:36] ok, thanks! [13:10:50] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1062007 [13:19:02] +1'ed . By looking at logstash the only relevant traffic are 5-6 requests over the last 6 months (which were done by us/me) [13:19:49] and all of them were made to staging [13:20:00] https://logstash.wikimedia.org/goto/cccf95475bc72143d677a4f4e892ff00 [13:34:19] Roger that. Merged and deployed (or undeployed?) [13:34:52] Danke! [13:36:45] about the language agnostic articlequality model: I'm trying to think of a meaningful logical grouping of such models so that we avoid creating many many namespaces. I think it would make sense to have an articlemodels or similar where we have all models that score article/revisions [13:38:09] there is always the issue with models that are already in use and it is more difficult to transfer (e.g. outlink or revertrisk), but maybe we can map all of them and have a certain number of namespaces [13:57:51] The other angle would be to put everything that needs to talk to the same outside services in one NS, but in our case, that set is like 99% identical between services, so it's not a useful delineation [14:04:26] this makes sense as well from a deployment perspective [14:14:06] Morning all [14:14:59] hi Chris1 [14:15:05] *! [14:15:24] \o [14:19:12] https://arxiv.org/abs/2406.13843 An interesting overview of how Generative AI is misused for various purposes, like scams and the ike [14:54:49] nice resource! [14:55:20] I made a patch to update outlink model https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1062036 (we had a pending update before the LLM sprint) [15:04:40] weebale! [15:11:06] 06Machine-Learning-Team, 13Patch-For-Review: Fix articletopic-outlink CrashLoopBackOff issue - https://phabricator.wikimedia.org/T370408#10058133 (10isarantopoulos) 05Open→03Resolved [16:06:12] (03PS1) 10Ilias Sarantopoulos: (WIP) locust: add articlequality model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1062049 (https://phabricator.wikimedia.org/T360455) [16:25:06] going afk folks, cu tomorrow! [16:29:17] Have a good evening Isaranto! [16:31:59] 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: Perform a load test for Multilingual Revert Risk on LiftWing - https://phabricator.wikimedia.org/T372298 (10Samwalton9-WMF) 03NEW [16:32:44] 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: Use multilingual revert risk model in Automoderator on supported wikis - https://phabricator.wikimedia.org/T365581#10058463 (10Samwalton9-WMF) [20:29:13] (03PS4) 10Eamedina: WIP - Community-defined campaign translations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1059945 (https://phabricator.wikimedia.org/T371515) [20:29:55] (03CR) 10CI reject: [V:04-1] WIP - Community-defined campaign translations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1059945 (https://phabricator.wikimedia.org/T371515) (owner: 10Eamedina) [20:32:17] (03CR) 10Eamedina: WIP - Community-defined campaign translations (035 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1059945 (https://phabricator.wikimedia.org/T371515) (owner: 10Eamedina) [20:44:07] (03CR) 10Eamedina: WIP - Community-defined campaign translations (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1059945 (https://phabricator.wikimedia.org/T371515) (owner: 10Eamedina)