[06:07:32] <wikibugs>	 (03PS5) 10Kevin Bazira: logo-detection: restrict image processing to trusted domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449)
[06:27:45] <isaranto>	 Good morning folks, I'm back!
[07:01:16] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9779681 (10isarantopoulos)  > We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd...
[07:03:00] <isaranto>	 if you thought you had seen it all -> https://github.com/learnk8s/xlskubectl 😛
[08:14:34] <klausman>	 Morning Ilias! And welcome back :)
[08:14:45] <klausman>	 (also shuddering at that xls... thing)
[08:22:14] <klausman>	 kevinbazira: the commons/upload change has been deployed to admin_ng in staging. The two serving cluster will follow shortly.
[08:23:04] <kevinbazira>	 klausman: o/ thanks
[08:23:30] <kevinbazira>	 isaranto: o/ welcome back!
[08:23:52] <isaranto>	 o/ Tobias & Kevin!
[08:40:27] <klausman>	 All the pending changes (extsvc, commons/upload and a few noop/not-affecting-us ones) have been pushed to staging and prod.
[08:59:54] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881)
[09:02:09] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881)
[09:02:55] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881)
[09:03:12] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "You're right! I set users=2 in order to spawn 1 user per model server." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) (owner: 10Ilias Sarantopoulos)
[10:01:22] * klausman lunch
[10:24:12] * isaranto lunch!
[11:09:51] <wikibugs>	 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9780308 (10XiaoXiao-WMF)
[12:34:42] <elukey>	 hello folks!
[12:40:52] <klausman>	 \o
[12:41:30] <elukey>	 I found https://docs.kernel.org/gpu/drm-internals.html and the following seems interesting
[12:41:34] <elukey>	 # echo 0xf > /sys/module/drm/parameters/debug
[12:41:54] <elukey>	 it enables debug logging, would it be ok if I tried it on ml-staging2001?
[12:42:00] <isaranto>	 o/ elukey: is there anything I can check regarding the torch image?
[12:42:49] <elukey>	 isaranto: o/ helloooo
[12:42:50] <klausman>	 elukey: fine by me re: enabling debugging. 
[12:43:29] <elukey>	 isaranto: for the moment no, I have some ideas after reading the drm code but I'd need to get more info, not sure if debug helps but we'll see
[12:43:30] <klausman>	 especially on staging, the perf impact is probably trivial. We just need to remember to turn it off if/before we do any load testing there.
[12:44:01] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: utils: slow function execution wrapper [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663)
[12:45:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Updated!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos)
[13:10:01] <elukey>	 isaranto: what is the procedure that you follow to test a gpu load via python? I need to test multiple drm options, and killing the pod every time is overkill
[13:13:56] <chrisalbon>	 Good morning all
[13:14:00] <elukey>	 morning!
[13:14:02] <isaranto>	 hey Chris!
[13:14:24] <isaranto>	 elukey: it depends on what you want to do but I think that you need to kill the pod anyway :(
[13:14:42] <elukey>	 isaranto: basically just trigger the error msg
[13:14:50] <isaranto>	 even if you just edit the isvc the pod will terminate and start a new one
[13:15:11] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9780756 (10mfossati) @isarantopoulos , totally agree, makes a lot of sense.
[13:15:45] <isaranto>	 I'm jumping in a meeting but let's talk about it afterwards. I'd like to find a better way as well
[13:18:07] <elukey>	 found it :)
[13:18:08] <elukey>	 kubectl exec mistral-7b-instruct-gpu-predictor-00006-deployment-98f68d98c4kg -n experimental -- /usr/bin/python3 -c "import torch; torch.cuda.is_available()"
[13:28:09] <isaranto>	 a sorry didnt understand you meant that thing
[13:28:56] <isaranto>	 yeah another thing I do is attach a shell and explore the pod if I have to 
[13:29:23] <isaranto>	 kubectl exec -it --entrypoint /bin/bash ...
[13:30:34] <elukey>	 sure sure
[13:38:39] <elukey>	 of course the debug logging doesn't help
[13:42:53] <klausman>	 :-(
[14:10:33] <elukey>	 klausman: when you found the ioctl line in strace, did you follow the python process right?
[14:10:50] <elukey>	 ah ok wait now I get why I cannot repro, I am stupid
[14:10:53] <elukey>	 nevermind
[14:10:53] <klausman>	 yes, and all forks (strace -fF)
[14:30:46] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9780991 (10kevinbazira) As discussed in today's meeting, adding image objects to the API request significantly increases the payload size. See sample payloads in P620...
[15:20:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[15:20:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[15:27:24] <elukey>	 lovely again viwiki
[15:28:41] <elukey>	 and very nicely we have two pods, one just autoscaled
[15:30:02] <elukey>	 yep again cpu completely saturated
[15:30:06] <elukey>	 https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-pod=viwiki-reverted-predictor-default-00017-deployment-5d8d97f5g2xt&var-pod=viwiki-reverted-predictor-default-00017-deployment-5d8d97fcv5j2&var-container=All&from=now-1h&to=now
[15:30:33] <elukey>	 at this point reverted needs to scale up even sooner than 10 rps
[15:30:36] <elukey>	 sigh
[15:33:15] <elukey>	 the second pod didn't help too much
[15:34:29] <klausman>	 I suspect even with the add'l pod, whatever hits the saturated pod still gets timeouts.
[15:35:11] <elukey>	 yes but the assumption is that the new pods will absorb some of the traffic giving relief to the saturated pod
[15:35:17] <elukey>	 with ruwiki it worked
[15:35:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[15:35:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[15:36:36] <elukey>	 the DC errors went down a little
[15:36:57] <elukey>	 I'd vote to just lower the rps threshold for autoscaling to 5 in reverted
[15:37:31] <klausman>	 One thing I notice: the first instance (ruwiki) stayed broken until we restarted it, for several hours. But now thew broken isvcs seems to recover after a while
[15:38:02] <klausman>	 Do you know over what time the 5rps will be evaluated?
[15:38:26] <elukey>	 I don't recall, you'd need to check knative's docs
[15:38:33] <elukey>	 for the moment I manually bumped the min replicas to 4
[15:38:47] <klausman>	 ack.
[15:39:07] <klausman>	 I'm fine with lowering the rps limit to 5, at least to see if it helps.
[15:40:07] <elukey>	 can you file the patch?
[15:41:27] <klausman>	 sure
[15:41:34] <klausman>	 (currently in the staff meeting)
[15:42:35] <isaranto>	 iirc there is a default sliding window for the rps calculation in knative
[15:43:39] <isaranto>	 a yes I think it is this one and default seems to be 60s https://knative.dev/docs/serving/autoscaling/scale-bounds/#stable-window
[15:43:45] <klausman>	 :+1:
[15:45:12] <elukey>	 most of the containers have high cpu usage, I think we'll be able to get more problematic rev-ids from the logs  this time
[15:45:58] <klausman>	 elukey: do you want me to fold yhe min replicas=4 change into the patch, or should we hope the rps change is enough?
[15:46:49] <elukey>	 the latter should be ok
[15:47:42] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029222
[15:48:56] <elukey>	 klausman: for the moment I'd just tune the reverted namespace, that seems little and confined, if we see it working we can apply to the damaging/goodfaith ones (that are way bigger). Does it sound good?
[15:49:08] <klausman>	 SGTM, will update the change
[15:51:07] <klausman>	 Updated.
[15:54:15] <elukey>	 +1ed thanks
[15:54:52] <klausman>	 Will deploy in a hot second
[15:57:11] <klausman>	 deployed in server-codfw
[15:57:18] <klausman>	 serve*
[15:58:54] <klausman>	 I'll also deploy to staging, which will include the 15->10 change for the other isvcs
[16:00:55] <klausman>	 Mh, that would also update the docher image used form 2024-04-18-100317 to 2024-04-24-153759. Any objections?
[16:01:21] <klausman>	 (both changes for viwiki only)
[16:05:59] <isaranto>	 thanks Tobias!
[16:06:11] <klausman>	 Proceeding!
[16:06:55] <klausman>	 And all done
[16:12:14] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9781301 (10isarantopoulos) @mfossati We noticed that the user can define the width in the url like in this example `http://commons.wikimedia.org/w/index.php?title=Spe...
[16:13:35] <isaranto>	 kevinbazira: need some help if you can shed some light. I cant recall why we need to have "target": "logo" in the request
[16:27:48] <isaranto>	 going afk folks, have a nice rest of day!
[16:28:21] <klausman>	 \o heading out as well
[16:39:13] <elukey>	 same, have a nice rest of the day!
[16:53:18] <wikibugs>	 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9781491 (10mfossati) >>! In T363506#9781301, @isarantopoulos wrote: > @mfossati We noticed that the user can define the width in the url like in this example `http://...
[17:25:50] <kevinbazira>	 @isaranto: we use `"target": "logo"` to specify the target class because the model can classify images that are album covers, books, logos, and screenshots. see more details in: https://phabricator.wikimedia.org/T352748
[21:19:55] <wikibugs>	 (03PS1) 10Umherirrender: tests: Migrate assertSelect() to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029283
[23:39:01] <wikibugs>	 (03CR) 10DannyS712: [C:03+2] tests: Migrate assertSelect() to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029283 (owner: 10Umherirrender)