[07:35:36] Good morning! [07:35:49] I'll review the ref-need model! [08:05:28] morning! :) [08:16:45] ο/ [08:44:49] o/ the RR model-servers that were migrated to the src dir have been: [08:44:49] 1. tested staging: https://phabricator.wikimedia.org/P68587 [08:44:50] 2. deployed in prod: https://phabricator.wikimedia.org/P68590 [08:47:18] 06Machine-Learning-Team: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#10111882 (10kevinbazira) [08:52:40] ο/ Kevin, nice! [08:58:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:58:04] Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [08:58:04] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-default-00021-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:03:04] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:07:07] ---^ minReplicas for RRML is set to 5, but only 3 created successfully [09:09:13] the old ones are still there. is it because running out of resources? [09:11:13] probably, I've noticed this warning is firing in eqiad for RRML which currently has: [09:11:13] https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/revertrisk/values.yaml#L76-L77 [09:11:14] ``` [09:11:14] minReplicas: 5 [09:11:14] maxReplicas: 15 [09:11:14] ``` [09:11:14] should we reduce the maxReplicas to 10? [09:11:36] I don't see an issue with resources https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=eqiad&var-prometheus=k8s-mlserve [09:11:47] could due to ns resource limitations [09:11:55] for each RRML, 4 cpu and 6G memory are requested [09:12:21] Morning! [09:12:25] maybe, what is the ns resource limiation? [09:12:42] what Aiko said was what I was about to type :) [09:14:10] 46m Warning FailedCreate replicaset/revertrisk-multilingual-predictor-default-00021-deployment-75bf9dbf64 Error creating: pods "revertrisk-multilingual-predictor-default-00021-deployment4jwqx" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=7, used: limits.cpu=86, limited: limits.cpu=90 [09:17:45] maybe manually delete the old ones to have have more cpu? or we increase the quota [09:18:07] I am currently checking whether it's a NS limit vs us actually being out of capacity [09:18:54] ack! [09:21:50] according to the k8s resources it isn't a cluster issue (if I'm reading the grafana dashboard properly) [09:22:09] then again why would codfw not have an issue? [09:22:22] the ns resourcequota says [09:22:22] `quota-compute-resources 476d requests.cpu: 58500m/90, requests.memory: 71754Mi/100Gi limits.cpu: 87/90, limits.memory: 85194Mi/100Gi` [09:23:39] codfw has the same issue (only 3 new replicas). I don't know why there is no firing [09:24:00] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dml shows both firing [09:28:15] it probably is the FIRING: [2x] that indicates this as alerts are grouped [09:29:49] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1070212 [09:30:40] Note the comment with the kubectl diff [09:31:49] hang on, that patch may be wrong [09:33:17] the limitRange(s) refer to pod/container limits while we are interested in total ns quotas, right? [09:37:07] yeah, and I should be editing ResourceQuota instead [09:38:35] ack! tbh I don't get how we exceed the limits though 87/90 seems legit although close to threshold [09:39:01] unless the previous deployment messes things up as it occupies cpus that can't be assigned [09:40:08] isaranto: I think that 87/90 is before adding a new container/pod (RRML wants 4 cpus, so it would fit the error) [09:40:22] So the currentl RQ limits are 90CPU/100Gi, what shouild we aim for? [09:40:27] ack! [09:49:15] Updated the patch, went with a 1/3 increase (roughly [10:01:52] +1ed but the commit msg should also be changed [10:02:39] done :) [10:03:12] kevinbazira: isaranto: ok to merge and push? [10:03:35] yep. [10:03:54] yep! thanks [10:03:55] ty all [10:04:12] I just had a question about the pod :{}. if we omit it is the result the same? [10:04:37] The pod:{} part is so the defaults for that subsection are used. Because... YAML [10:05:11] Also see revscoring-articlequality, l.177 further up in the file [10:05:37] ok! [10:09:09] (03PS3) 10AikoChou: reference-need: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) [10:10:17] Looks like they are scheduling now. No non-running replicas in eqiad. [10:10:20] Will also push codfwe [10:13:04] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:20:38] Hmm. In codfw the old deployment is still there, despite three pods of the new old one being up and healthy [10:21:00] three pods of the new one * [10:30:39] Ok, fiddled with min/max replicas and it cleared [10:30:50] * klausman lunch [10:32:27] thanks Tobias! [10:33:04] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:33:04] Deployment revertrisk-multilingual-predictor-default-00022-deployment in revertrisk at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [10:33:04] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-default-00022-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:39:37] thanks Tobias o/ [10:39:43] * aiko lunch 2 [10:57:12] danke Tobias! [10:57:49] (03PS1) 10Ilias Sarantopoulos: articlequality: update output schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070228 (https://phabricator.wikimedia.org/T360455) [10:59:01] * isaranto lunch! [11:54:23] (03PS4) 10AikoChou: reference-need: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) [12:46:44] (03CR) 10Ilias Sarantopoulos: "I bumped into an issue while trying to install the requirements on macOS with python 3.11. Here is the paste -> https://phabricator.wikime" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [13:01:25] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10112834 (10isarantopoulos) Hi @Aitolkyn! is 1.13.1 the pytorch version that was used during training as shown in the [[ https://gitlab.wikimedia.org/repo... [13:31:16] 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10112967 (10Isaac) @isarantopoulos very exciting thank you! @FNavas-foundation ^^ for testing and to see what features are available. [14:00:59] 06Machine-Learning-Team: Create a Makefile to run locust load tests - https://phabricator.wikimedia.org/T369728#10113132 (10kevinbazira) a:03kevinbazira [14:29:55] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#10113262 (10isarantopoulos) p:05Triage→03Medium [14:30:58] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#10113263 (10klausman) p:05Medium→03Triage a:03klausman [14:49:16] 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113385 (10isarantopoulos) a:03achou [15:00:08] 06Machine-Learning-Team: Create a Makefile to run locust load tests - https://phabricator.wikimedia.org/T369728#10113410 (10isarantopoulos) p:05Triage→03Medium [15:05:27] 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113441 (10achou) @MunizaA Is this task for reference-risk? Is the title incorrect? [15:05:50] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10113443 (10achou) a:03achou [15:07:55] 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113450 (10achou) a:05achou→03None [15:30:52] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#10113596 (10Scardenasmolinar) @Samwalton9-WMF should we create a spike to investigate how to install... [16:14:49] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10113836 (10MunizaA) [16:20:07] 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10113869 (10MunizaA) >>! In T372405#10113441, @achou wrote: > @MunizaA Is this task for reference-risk? Is the title incorrect? @AikoChou correct, this was supposed to be a pl... [16:31:40] * isaranto afk [16:34:10] (03CR) 10AikoChou: "Yeah I had the same issue with python 3.11. I was using python 3.9 to run it previously. Let's wait for research's reply. Thanks for follo" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou) [17:52:07] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10114447 (10MunizaA) Hi @isarantopoulos, the pytorch version was pinned in knowledge-integrity when the transformers dependency was added. I was under the... [18:06:37] (03PS1) 10Nik Gkountas: WIP: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) [18:08:06] (03CR) 10CI reject: [V:04-1] WIP: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) (owner: 10Nik Gkountas)