[06:07:32] (03PS5) 10Kevin Bazira: logo-detection: restrict image processing to trusted domains [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1023542 (https://phabricator.wikimedia.org/T363449) [06:27:45] Good morning folks, I'm back! [07:01:16] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9779681 (10isarantopoulos) > We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd... [07:03:00] if you thought you had seen it all -> https://github.com/learnk8s/xlskubectl 😛 [08:14:34] Morning Ilias! And welcome back :) [08:14:45] (also shuddering at that xls... thing) [08:22:14] kevinbazira: the commons/upload change has been deployed to admin_ng in staging. The two serving cluster will follow shortly. [08:23:04] klausman: o/ thanks [08:23:30] isaranto: o/ welcome back! [08:23:52] o/ Tobias & Kevin! [08:40:27] All the pending changes (extsvc, commons/upload and a few noop/not-affecting-us ones) have been pushed to staging and prod. [08:59:54] (03PS2) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) [09:02:09] (03PS3) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) [09:02:55] (03PS4) 10Ilias Sarantopoulos: revertrisk: update locust results [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) [09:03:12] (03CR) 10Ilias Sarantopoulos: "You're right! I set users=2 in order to spawn 1 user per model server." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1025805 (https://phabricator.wikimedia.org/T361881) (owner: 10Ilias Sarantopoulos) [10:01:22] * klausman lunch [10:24:12] * isaranto lunch! [11:09:51] 06Machine-Learning-Team: Allow calling revertrisk language agnostic and revert risk multilingual APIs in a pre-save context - https://phabricator.wikimedia.org/T356102#9780308 (10XiaoXiao-WMF) [12:34:42] hello folks! [12:40:52] \o [12:41:30] I found https://docs.kernel.org/gpu/drm-internals.html and the following seems interesting [12:41:34] # echo 0xf > /sys/module/drm/parameters/debug [12:41:54] it enables debug logging, would it be ok if I tried it on ml-staging2001? [12:42:00] o/ elukey: is there anything I can check regarding the torch image? [12:42:49] isaranto: o/ helloooo [12:42:50] elukey: fine by me re: enabling debugging. [12:43:29] isaranto: for the moment no, I have some ideas after reading the drm code but I'd need to get more info, not sure if debug helps but we'll see [12:43:30] especially on staging, the perf impact is probably trivial. We just need to remember to turn it off if/before we do any load testing there. [12:44:01] (03PS8) 10Ilias Sarantopoulos: utils: slow function execution wrapper [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) [12:45:07] (03CR) 10Ilias Sarantopoulos: "Updated!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1024425 (https://phabricator.wikimedia.org/T362663) (owner: 10Ilias Sarantopoulos) [13:10:01] isaranto: what is the procedure that you follow to test a gpu load via python? I need to test multiple drm options, and killing the pod every time is overkill [13:13:56] Good morning all [13:14:00] morning! [13:14:02] hey Chris! [13:14:24] elukey: it depends on what you want to do but I think that you need to kill the pod anyway :( [13:14:42] isaranto: basically just trigger the error msg [13:14:50] even if you just edit the isvc the pod will terminate and start a new one [13:15:11] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9780756 (10mfossati) @isarantopoulos , totally agree, makes a lot of sense. [13:15:45] I'm jumping in a meeting but let's talk about it afterwards. I'd like to find a better way as well [13:18:07] found it :) [13:18:08] kubectl exec mistral-7b-instruct-gpu-predictor-00006-deployment-98f68d98c4kg -n experimental -- /usr/bin/python3 -c "import torch; torch.cuda.is_available()" [13:28:09] a sorry didnt understand you meant that thing [13:28:56] yeah another thing I do is attach a shell and explore the pod if I have to [13:29:23] kubectl exec -it --entrypoint /bin/bash ... [13:30:34] sure sure [13:38:39] of course the debug logging doesn't help [13:42:53] :-( [14:10:33] klausman: when you found the ioctl line in strace, did you follow the python process right? [14:10:50] ah ok wait now I get why I cannot repro, I am stupid [14:10:53] nevermind [14:10:53] yes, and all forks (strace -fF) [14:30:46] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9780991 (10kevinbazira) As discussed in today's meeting, adding image objects to the API request significantly increases the payload size. See sample payloads in P620... [15:20:44] FIRING: LiftWingServiceErrorRate: ... [15:20:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:27:24] lovely again viwiki [15:28:41] and very nicely we have two pods, one just autoscaled [15:30:02] yep again cpu completely saturated [15:30:06] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-pod=viwiki-reverted-predictor-default-00017-deployment-5d8d97f5g2xt&var-pod=viwiki-reverted-predictor-default-00017-deployment-5d8d97fcv5j2&var-container=All&from=now-1h&to=now [15:30:33] at this point reverted needs to scale up even sooner than 10 rps [15:30:36] sigh [15:33:15] the second pod didn't help too much [15:34:29] I suspect even with the add'l pod, whatever hits the saturated pod still gets timeouts. [15:35:11] yes but the assumption is that the new pods will absorb some of the traffic giving relief to the saturated pod [15:35:17] with ruwiki it worked [15:35:44] RESOLVED: LiftWingServiceErrorRate: ... [15:35:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:36:36] the DC errors went down a little [15:36:57] I'd vote to just lower the rps threshold for autoscaling to 5 in reverted [15:37:31] One thing I notice: the first instance (ruwiki) stayed broken until we restarted it, for several hours. But now thew broken isvcs seems to recover after a while [15:38:02] Do you know over what time the 5rps will be evaluated? [15:38:26] I don't recall, you'd need to check knative's docs [15:38:33] for the moment I manually bumped the min replicas to 4 [15:38:47] ack. [15:39:07] I'm fine with lowering the rps limit to 5, at least to see if it helps. [15:40:07] can you file the patch? [15:41:27] sure [15:41:34] (currently in the staff meeting) [15:42:35] iirc there is a default sliding window for the rps calculation in knative [15:43:39] a yes I think it is this one and default seems to be 60s https://knative.dev/docs/serving/autoscaling/scale-bounds/#stable-window [15:43:45] :+1: [15:45:12] most of the containers have high cpu usage, I think we'll be able to get more problematic rev-ids from the logs this time [15:45:58] elukey: do you want me to fold yhe min replicas=4 change into the patch, or should we hope the rps change is enough? [15:46:49] the latter should be ok [15:47:42] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1029222 [15:48:56] klausman: for the moment I'd just tune the reverted namespace, that seems little and confined, if we see it working we can apply to the damaging/goodfaith ones (that are way bigger). Does it sound good? [15:49:08] SGTM, will update the change [15:51:07] Updated. [15:54:15] +1ed thanks [15:54:52] Will deploy in a hot second [15:57:11] deployed in server-codfw [15:57:18] serve* [15:58:54] I'll also deploy to staging, which will include the 15->10 change for the other isvcs [16:00:55] Mh, that would also update the docher image used form 2024-04-18-100317 to 2024-04-24-153759. Any objections? [16:01:21] (both changes for viwiki only) [16:05:59] thanks Tobias! [16:06:11] Proceeding! [16:06:55] And all done [16:12:14] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9781301 (10isarantopoulos) @mfossati We noticed that the user can define the width in the url like in this example `http://commons.wikimedia.org/w/index.php?title=Spe... [16:13:35] kevinbazira: need some help if you can shed some light. I cant recall why we need to have "target": "logo" in the request [16:27:48] going afk folks, have a nice rest of day! [16:28:21] \o heading out as well [16:39:13] same, have a nice rest of the day! [16:53:18] 06Machine-Learning-Team, 06Structured-Data-Backlog: Pass image objects to the logo detection service - https://phabricator.wikimedia.org/T363506#9781491 (10mfossati) >>! In T363506#9781301, @isarantopoulos wrote: > @mfossati We noticed that the user can define the width in the url like in this example `http://... [17:25:50] @isaranto: we use `"target": "logo"` to specify the target class because the model can classify images that are album covers, books, logos, and screenshots. see more details in: https://phabricator.wikimedia.org/T352748 [21:19:55] (03PS1) 10Umherirrender: tests: Migrate assertSelect() to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029283 [23:39:01] (03CR) 10DannyS712: [C:03+2] tests: Migrate assertSelect() to SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1029283 (owner: 10Umherirrender)