[07:35:36] <isaranto>	 Good morning! 
[07:35:49] <isaranto>	 I'll review the ref-need model!
[08:05:28] <aiko>	 morning! :)
[08:16:45] <isaranto>	 ο/
[08:44:49] <kevinbazira>	 o/  the RR model-servers that were migrated to the src dir have been:
[08:44:49] <kevinbazira>	 1. tested staging: https://phabricator.wikimedia.org/P68587
[08:44:50] <kevinbazira>	 2. deployed in prod: https://phabricator.wikimedia.org/P68590
[08:47:18] <wikibugs>	 06Machine-Learning-Team: Reorganize LiftWing isvcs repo structure to improve maintainability - https://phabricator.wikimedia.org/T369344#10111882 (10kevinbazira)
[08:52:40] <isaranto>	 ο/ Kevin, nice!
[08:58:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[08:58:04] <jinxer-wm>	 Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[08:58:04] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-default-00021-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[09:03:04] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[09:07:07] <aiko>	 ---^ minReplicas for RRML is set to 5, but only 3 created successfully
[09:09:13] <aiko>	 the old ones are still there. is it because running out of resources? 
[09:11:13] <kevinbazira>	 probably, I've noticed this warning is firing in eqiad for RRML which currently has:
[09:11:13] <kevinbazira>	 https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/ml-services/revertrisk/values.yaml#L76-L77
[09:11:14] <kevinbazira>	 ```
[09:11:14] <kevinbazira>	 minReplicas: 5
[09:11:14] <kevinbazira>	 maxReplicas: 15
[09:11:14] <kevinbazira>	 ```
[09:11:14] <kevinbazira>	 should we reduce the maxReplicas to 10?
[09:11:36] <isaranto>	 I don't see an issue with resources https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=eqiad&var-prometheus=k8s-mlserve
[09:11:47] <isaranto>	 could due to ns resource limitations
[09:11:55] <aiko>	 for each RRML, 4 cpu and 6G memory are requested
[09:12:21] <klausman>	 Morning!
[09:12:25] <aiko>	 maybe, what is the ns resource limiation?
[09:12:42] <klausman>	 what Aiko said was what I was about to type :)
[09:14:10] <klausman>	 46m         Warning   FailedCreate               replicaset/revertrisk-multilingual-predictor-default-00021-deployment-75bf9dbf64        Error creating: pods "revertrisk-multilingual-predictor-default-00021-deployment4jwqx" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=7, used: limits.cpu=86, limited: limits.cpu=90
[09:17:45] <aiko>	 maybe manually delete the old ones to have have more cpu? or we increase the quota
[09:18:07] <klausman>	 I am currently checking whether it's a NS limit vs us actually being out of capacity
[09:18:54] <aiko>	 ack!
[09:21:50] <isaranto>	 according to the k8s resources it isn't a cluster issue  (if I'm reading the grafana dashboard properly)
[09:22:09] <isaranto>	 then again why would codfw not have an issue?
[09:22:22] <isaranto>	 the ns resourcequota says
[09:22:22] <isaranto>	 `quota-compute-resources   476d   requests.cpu: 58500m/90, requests.memory: 71754Mi/100Gi   limits.cpu: 87/90, limits.memory: 85194Mi/100Gi`
[09:23:39] <aiko>	 codfw has the same issue (only 3 new replicas). I don't know why there is no firing 
[09:24:00] <klausman>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dml shows both firing
[09:28:15] <isaranto>	 it probably is the FIRING: [2x] that indicates this as alerts are grouped
[09:29:49] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1070212
[09:30:40] <klausman>	 Note the comment with the kubectl diff
[09:31:49] <klausman>	 hang on, that patch may be wrong
[09:33:17] <isaranto>	 the limitRange(s) refer to pod/container limits while we are interested in total ns quotas, right?
[09:37:07] <klausman>	 yeah, and I should be editing ResourceQuota instead
[09:38:35] <isaranto>	 ack! tbh I don't get how we exceed the limits though 87/90 seems legit although close to threshold
[09:39:01] <isaranto>	 unless the previous deployment messes things up as it occupies cpus that can't be assigned
[09:40:08] <elukey>	 isaranto: I think that 87/90 is before adding a new container/pod (RRML wants 4 cpus, so it would fit the error)
[09:40:22] <klausman>	 So the currentl RQ limits are 90CPU/100Gi, what shouild we aim for?
[09:40:27] <isaranto>	 ack!
[09:49:15] <klausman>	 Updated the patch, went with a 1/3 increase (roughly
[10:01:52] <aiko>	 +1ed but the commit msg should also be changed
[10:02:39] <klausman>	 done :)
[10:03:12] <klausman>	 kevinbazira: isaranto: ok to merge and push?
[10:03:35] <kevinbazira>	 yep.
[10:03:54] <isaranto>	 yep! thanks
[10:03:55] <klausman>	 ty all
[10:04:12] <isaranto>	 I just had a question about the pod :{}. if we omit it is the result the same?
[10:04:37] <klausman>	 The pod:{} part is so the defaults for that subsection are used. Because... YAML
[10:05:11] <klausman>	 Also see revscoring-articlequality, l.177 further up in the file
[10:05:37] <isaranto>	 ok!
[10:09:09] <wikibugs>	 (03PS3) 10AikoChou: reference-need: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902)
[10:10:17] <klausman>	 Looks like they are scheduling now. No  non-running replicas in eqiad.
[10:10:20] <klausman>	 Will also push codfwe
[10:13:04] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-default-00021-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:20:38] <klausman>	 Hmm. In codfw the old deployment is still there, despite three pods of the new old one being up and healthy
[10:21:00] <klausman>	 three pods of the new one *
[10:30:39] <klausman>	 Ok, fiddled with min/max replicas and it cleared
[10:30:50] * klausman lunch
[10:32:27] <isaranto>	 thanks Tobias!
[10:33:04] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[10:33:04] <jinxer-wm>	 Deployment revertrisk-multilingual-predictor-default-00022-deployment in revertrisk at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ...
[10:33:04] <jinxer-wm>	 https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-default-00022-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:39:37] <aiko>	 thanks Tobias o/
[10:39:43] * aiko lunch 2
[10:57:12] <kevinbazira>	 danke Tobias!
[10:57:49] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: articlequality: update output schema [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070228 (https://phabricator.wikimedia.org/T360455)
[10:59:01] * isaranto lunch!
[11:54:23] <wikibugs>	 (03PS4) 10AikoChou: reference-need: initial commit [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902)
[12:46:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I bumped into an issue while trying to install the requirements on macOS with python 3.11. Here is the paste -> https://phabricator.wikime" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[13:01:25] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10112834 (10isarantopoulos) Hi @Aitolkyn! is 1.13.1 the pytorch version that was used during training as shown in the [[ https://gitlab.wikimedia.org/repo...
[13:31:16] <wikibugs>	 06Machine-Learning-Team, 06Content-Transform-Team, 06Research, 13Patch-For-Review: Add Article Quality Model to LiftWing - https://phabricator.wikimedia.org/T360455#10112967 (10Isaac) @isarantopoulos very exciting thank you!  @FNavas-foundation ^^ for testing and to see what features are available.
[14:00:59] <wikibugs>	 06Machine-Learning-Team: Create a Makefile to run locust load tests - https://phabricator.wikimedia.org/T369728#10113132 (10kevinbazira) a:03kevinbazira
[14:29:55] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#10113262 (10isarantopoulos) p:05Triage→03Medium
[14:30:58] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Data-Platform-SRE: Investigate Label functionality of AMD GPU device plugin on k8s - https://phabricator.wikimedia.org/T373806#10113263 (10klausman) p:05Medium→03Triage a:03klausman
[14:49:16] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113385 (10isarantopoulos) a:03achou
[15:00:08] <wikibugs>	 06Machine-Learning-Team: Create a Makefile to run locust load tests - https://phabricator.wikimedia.org/T369728#10113410 (10isarantopoulos) p:05Triage→03Medium
[15:05:27] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113441 (10achou) @MunizaA Is this task for reference-risk? Is the title incorrect?
[15:05:50] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10113443 (10achou) a:03achou
[15:07:55] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host reference needed on Lift Wing - https://phabricator.wikimedia.org/T372405#10113450 (10achou) a:05achou→03None
[15:30:52] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#10113596 (10Scardenasmolinar) @Samwalton9-WMF should we create a spike to investigate how to install...
[16:14:49] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10113836 (10MunizaA)
[16:20:07] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Request to host the Reference Risk Model on LiftWing - https://phabricator.wikimedia.org/T372405#10113869 (10MunizaA) >>! In T372405#10113441, @achou wrote: > @MunizaA Is this task for reference-risk? Is the title incorrect?  @AikoChou correct, this was supposed to be a pl...
[16:31:40] * isaranto afk
[16:34:10] <wikibugs>	 (03CR) 10AikoChou: "Yeah I had the same issue with python 3.11. I was using python 3.9 to run it previously. Let's wait for research's reply. Thanks for follo" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1070060 (https://phabricator.wikimedia.org/T371902) (owner: 10AikoChou)
[17:52:07] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Request to host the Reference Need Model on LiftWing - https://phabricator.wikimedia.org/T371902#10114447 (10MunizaA) Hi @isarantopoulos, the pytorch version was pinned in knowledge-integrity when the transformers dependency was added. I was under the...
[18:06:37] <wikibugs>	 (03PS1) 10Nik Gkountas: WIP: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132)
[18:08:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: Fetch campaign metadata and return them with recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1070308 (https://phabricator.wikimedia.org/T373132) (owner: 10Nik Gkountas)