[06:26:03] (03CR) 10Kevin Bazira: [C:03+2] logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [06:33:29] Good morning o/ [06:37:24] (03Merged) 10jenkins-bot: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira) [07:19:25] 06Machine-Learning-Team: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749 (10kevinbazira) 03NEW [07:20:26] 06Machine-Learning-Team, 13Patch-For-Review: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9721447 (10kevinbazira) [07:29:51] 06Machine-Learning-Team, 13Patch-For-Review: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9721480 (10kevinbazira) The logo-detection model-server has been containerized and added to the the CI pipeline which published it successfully t... [08:37:25] Morning! [08:40:27] \o [09:03:13] morning o/ [09:04:37] heya, aiko \o [09:09:09] heya! [09:09:10] 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9721763 (10isarantopoulos) Tried to check the GPU after attaching to the running container and executing the following in a python console. I'm getting the same result: ` >>>... [09:26:54] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9721798 (10klausman) I've experimented a bit on Thanos, and arrived at this query: ` (sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_... [09:47:42] There is currently an issue going on with the mwapi, which may affect our services as well. Looking at dashboards etc [09:49:43] ack [09:58:56] I think we're good in the sense that there's nothing for us to do (differently). The underlying issue may be addressed for now, but proper resolution will take more time [10:16:18] klausman: could you help me debug the GPU issue on ml-staging? [10:16:44] doesn't have to be now ofc! [10:21:59] Sure, can do! [10:22:06] maybe after lunch? [10:25:00] Sure! [10:25:03] thank you! [10:35:45] * isaranto lunch! [10:35:54] ditto :) [11:01:47] (03CR) 10Matěj Suchánek: Exclude first/only revision on page from scoring (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [11:29:18] (03PS1) 10Kevin Bazira: logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) [11:32:59] hello folks [11:36:13] 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9722143 (10elukey) There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the... [11:39:01] (03CR) 10Ilias Sarantopoulos: [C:03+1] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [11:39:14] hello! [11:43:35] (03CR) 10AikoChou: [C:03+1] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [11:47:50] heya Luca! [11:52:11] (03CR) 10Kevin Bazira: [C:03+2] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [11:52:54] (03Merged) 10jenkins-bot: logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira) [11:52:57] isaranto: so what are you trying to do in staging? and how is it failing? [11:57:47] I'm trying to use it in the same way we have been doing before and I get the failure reported here https://phabricator.wikimedia.org/T357986#9721763 [11:59:21] do you ok, having a look [12:02:19] the difference is the following: we are using a different base image , the pytorch image which has different rocm(rocm5.5) version than what is installed (rocm5.4.2) [12:03:08] mmh. that is a likely possibility [12:03:36] yes, unfortunately this is the only thing I can think of at the moment [12:05:07] rocm5.4 is about a year old by now [12:06:13] It may be time to import 5.5 to the wmf repo [12:07:29] we went with 5.5 as this is what is supported by the pytorch version we want for the huggingfaceserver (pytorch 2.1.2). Otherwise we'd have to built pytorch ourselves [12:07:56] yeah, I'd aotherwise also consider the newer versions of rocm, but I don't want to stray too far [12:08:53] before jumping to any conclusion, we should verify how/if the k8s worker version of the driver affects the one running inside the container [12:09:23] in theory the k8s worker one shouldn't count a lot [12:10:01] another issue could just be that the os user can't access the GPU drivers [12:10:19] But then the other GPU models we've tried would also fail, no? [12:10:35] GPU-based models I mean [12:11:29] it is a different/new image [12:11:40] ah, good point [12:12:46] I'm currently trying to figure out what the control point of GPU access is (i.e. what device or what user groups are usually used) [12:14:05] in the other images (llm image) we install `libdrm-amdgpu1` through blubber. In the pytorch image we install it in the production images https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/amd/pytorch21/Dockerfile.template#5 [12:15:55] Mh, that chain of user modifications is hard to read, gimme a moment to try and uderstand what it does. [12:16:28] it is all explained in the readme [12:16:46] sure, I'm just pasting info and explaining my thoughts [12:16:59] yep yep :) [12:17:35] I was adding context since IIRC Tobias was afk when we merged [12:18:48] I presume we just assume that the correct group for user somebody is also always somebody? [12:19:12] what do you mean? [12:19:21] `chown {{ "somebody" | uid }}:{{ "somebody" | uid }} /opt/lib` [12:19:34] that is what blubber does, we had to replicate it [12:19:37] note that the group (after `:`) is also using `uid` [12:19:47] Ack [12:21:05] In a non-docer system the user would to also be in the render video groups. [12:21:54] yep but we apply a special permission to allow other to read the devices (kfd etc..) [12:23:11] Where do we do that? [12:23:26] it is in the puppet config for the amd plugin [12:24:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/909968 IIRC [12:24:40] I suspect the issue is between the kubelet and the k8s plugin, ok if I restart them? [12:24:59] no objections from me [12:25:01] there are some horrors logged by the kubelet [12:25:47] isaranto: can you try to delete the pod? [12:25:58] yes, on it [12:27:02] done [12:27:27] I don't see horrors, do you see the gpu now? [12:27:31] elukey: do you want me to remove it completely for now? cause ofc it is recreated if I delete it [12:27:56] nono just shake it to see if the gpu gets picked up [12:28:18] it will take some time for the pod to start [12:28:24] It's still in the terminating phase [12:28:38] and until it's gone, there is no GPU for the new one [12:29:06] (unless it can somehow signal back to the kubelet that it didn't use the GPU after all) [12:31:45] it is gone now [12:31:55] Yep, new one is on init [12:33:50] Goood morning all [12:33:57] heyo Chris [12:34:59] isaranto: I have a question about the model download. [12:35:26] So I see it fetches llm/Mistral-7B-Instruct-v0.2/model-00001-of-00003.safetensors (and two more, but also llm/Mistral-7B-Instruct-v0.2/pytorch_model-00001-of-00003.bin (and tow more) [12:35:37] Do we need both types? [12:36:49] Also, still has the amdgpu_get_auth error in the logs [12:37:33] good question! no we don't! but wanted to run some checks before I left just one of them [12:37:42] sure, no worries. [12:37:48] safetensors load time is much faster so probably we'll keep that one [12:37:51] pod is up [12:38:49] elukey: when you said you saw horrors in the kubelet log, I presume you mean the one accessible via journalctl? [12:38:57] yep [12:39:19] Didn't see anything obviously bad in the last few minutes, so those were likely unrelated :-/ [12:39:48] the amd-k8s-plugin is a daemon that has to create a unix socket to be contacted by the kubelet, and in turns it registers itself to the kubelet [12:40:43] (03PS12) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) [12:40:49] I noticed something like [12:40:49] Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/amd.com_gpu: connect: no such file or directory". Reconnecting... [12:41:15] Daemon not running? [12:41:24] plus the k8s plugin daemon seems to restart a lot, at least from what systemctl reports [12:41:27] not sure why [12:41:43] isaranto: can you check if the gpu is recognized? [12:41:46] what's the unit name? [12:42:07] you can grep for k8s or gpu and you should find it, I always forget it [12:42:22] ah, amd-k8s-device-plugin.service [12:42:34] (03CR) 10Jsn.sherman: Exclude first/only revision on page from scoring (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman) [12:42:53] I did a ps ax|grep amd earlier and it wasn't there, so I suspected it may have a more obscure name [12:43:07] I'm checking [12:45:47] it is called amd-k8s-device-plugin [12:46:21] yepyep, found it [12:47:29] It doesn't seem to be logging anything at all [12:49:28] I found various stuff searching for gpu via `find / -name "*gpu*"`. I'm doing this by attaching a shell to the running container so i dn't have access to a lot of commands [12:51:34] is it ok if I temporarily test in staging mw-api connectivity? The current testing shouldn't be impacted [12:51:44] /dev/kfd perms inside the container look right (0o666) [12:51:46] I need to drop some configs to verify one doubt [12:52:03] klausman: check also the gpu devices [12:52:18] aha! [12:52:24] /dev/dri/render* [12:52:26] $ ls -l /dev/dri/ [12:52:28] total 0 [12:52:30] crw-rw---- 1 root video 226, 1 Apr 17 12:36 card1 [12:52:32] crw-rw-rw- 1 root 106 226, 128 Apr 17 12:36 renderD128 [12:52:49] would it need access to card1? Or only render*? [12:53:12] IIRC only render, but let's wait for isaranto to confirm [12:53:16] if it works or not [12:53:44] that GID 106 is also odd, but shoudln't break anything because it's 666 [12:53:55] kfd has the same GID [12:54:50] elukey: to confirm what? sry I'm lost a bit [12:55:00] isaranto: if the gpu is recognized now or not :) [12:55:07] on the pod I mean [12:55:19] The logs for the service mentioned the same permission error, so I doubt it [12:56:36] the gpu is there in the resources and is assigned to the pod, but from python I get the same errors [12:57:17] I dont know if I have another way to validate the gpu status from within the pod [12:59:19] klausman: what service do you mean? To check the logs [12:59:20] elukey: feel free to work on mw-api connectivity (never answered your previous question about it) [12:59:22] the kubelet? [12:59:38] kubectl logs -n exp mistral... [12:59:47] kubectl logs -n experimental mistral-7b-instruct-gpu-predictor-00005-deployment-5d6676d9lb6d [12:59:49] okok so also the pods complains, didn't now [13:00:12] yes, it's among the very first messages during startup [13:00:31] I see thanks [13:00:32] `amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)` four times, then it continues to load the model etc [13:05:34] elukey: so I did something super hacky and copied radeontop and libpciaccess.so into the container and tried to run it. It seems to work, but I am not sure that is a sufficient test to prove anything [13:06:19] It definitely recognises the GPU as an ARCTURUS modelm as well as the the right amount of RAM [13:06:45] klausman: how did you copy binaries? [13:06:50] docker cp [13:07:02] docker cp /sbin/radeontop 0374a90bad66:/sbin [13:07:03] okok, not great but it is staging [13:07:18] also it is bullseye vs bookworm, but should work anyway [13:07:35] yeah, hence, hacky. I was hoping it'd break and give a more useful error message than what we have so far [13:10:43] I checked on the nllb pod that we are running on ml-serve1001 with the gpu, and [13:10:47] crw-rw---- 1 root video 226, 2 Feb 7 14:10 card2 [13:10:49] crw-rw-rw- 1 root 106 226, 129 Feb 7 14:10 renderD129 [13:10:54] so the perms checks out [13:11:06] is that pod's user in the video group maybe? [13:11:42] it should be somebody, same user that we have elsewhere [13:12:39] I'm jumping in a call with Mercelis and will meet yall in our meeting later. thanks for all the help! [13:14:42] One thought: we've been assuming this is a permission error since the message mentions `amdgpu_get_auth`, but it may be a more general failure [13:16:22] at this point it maybe be a lot of things, but from the libdrm's code it seems that the result should be an fd pointing to the device [13:16:39] https://github.com/grate-driver/libdrm/blob/master/amdgpu/amdgpu_device.c#L80 [13:16:40] (03CR) 10Ilias Sarantopoulos: [C:03+2] update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman) [13:28:04] it's a shame we don't have the errno that is set in that ioctl context, it would tell us if it's actually an EPERM or something else. [13:28:38] at least we know it's an OS-level error (<0), not a driver error (>0) [13:29:54] (03Merged) 10jenkins-bot: update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman) [13:38:37] (03PS1) 10AikoChou: revertrisk: add support for base model's payloads in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) [13:54:01] (03CR) 10AikoChou: "Now we have two choices: 1) create a new endpoint for the batch model and don't touch the original RRLA model server or 2) use this patch " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [14:25:21] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of isvcs - https://phabricator.wikimedia.org/T362674#9722637 (10isarantopoulos) [14:26:07] 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9722640 (10isarantopoulos) [14:49:35] ok I just found out something weird [14:49:57] in our istio sidecar config, using api-ro.discovery.wmnet:80 (with the implicit proxy to :443) works [14:50:16] but if we use the new mw-api-int-ro.discovery.wmnet:4460 it doesn't [14:50:54] I kinda know why the new doesn't work, in fact there is a solution to make it work [14:50:58] not sure why the current config works though [14:52:58] ah okok it seems a matter of using 80 vs specifying a port like 4680 [14:53:01] * elukey sigh [14:53:08] maybe implicit rules somewhere? [14:56:07] in theory no, seems more a bug [14:58:56] like https://github.com/istio/istio/issues/21914 [14:59:54] hmm. [15:00:00] seems to be another obscure istio "Feature" [15:00:10] * elukey brb [15:01:31] I straced the simple test for torch/cuda in Python and in the log I see: [15:01:38] access("/dev/dri/renderD128", F_OK) = -1 EPERM (Operation not permitted) [15:02:45] So there is definitely a permission issue [15:03:22] aha! [15:09:17] this is wild. F_OK on access() only checks for the existence of a file, not actual access bits [15:10:29] And even if that is normal, the next syscall is: [15:10:41] ioctl(7, DRM_IOCTL_GET_CLIENT, 0x7fffeea94eb0) = -1 EACCES (Permission denied) [15:11:11] I'll pastebin the whoel sequence [15:11:58] https://phabricator.wikimedia.org/P60791 [15:12:48] Note how after the ioctl, we immediately see the error message being emitted [15:18:12] did you strace directly on the ml-staging node? [15:18:23] I did the same hack again of copying in the binary [15:18:39] I don't trust tracing from host into container [15:18:54] strace should work without the hack, you can use the pid of the container [15:19:13] (same for perf etc.. I used it in the past) [15:19:23] if you want to take a look at the log, it's in my homedir there as strace-gpu.out [15:19:43] Note that it's the whole thing from python starting, so somewhat long [15:39:00] gotta run an errand, bbiab [15:40:13] I'm going afk folks, lemme know if I should do/try anything else for the gpu, I'll check later. cu tomorrow! [15:47:18] 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9723080 (10elukey) Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than... [15:47:19] 06Machine-Learning-Team, 13Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9723070 (10elukey) Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging fo... [15:47:30] I added some thoughts about today's rabbit hole with istio in https://phabricator.wikimedia.org/T353622#9723070 [15:47:45] the move to mw-api-int-ro will be a little more complex than anticipated [15:53:43] 06Machine-Learning-Team, 13Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9723267 (10elukey) Overall steps: 1) Revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020738 since it is a... [15:55:32] at least now everything makes kind-of sense [15:56:15] I'll let others verify that I am not totally wrong before proceeding :D [16:06:17] I'll send feedback tomorrow morning if that is soon enough? Currently my brain is still in GPUs-and-EPERM mode :D [16:07:42] even later, not super urgent, I'll prep changes etc.. [16:07:53] ok, good. [16:08:06] it is just one more service entry, hopefully the repetition of the hostnames will be handled via yaml anchors [16:11:04] have a nice rest of the day folks! logging off [16:11:12] seeya [17:55:13] 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723796 (10Q-bit-array) Just to let you know - LiftWing/ORES has crashed again `name=curl output example $ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:... [18:10:16] o/ [18:11:15] regarding the above message I'm able to get a response https://phabricator.wikimedia.org/P60811 [18:16:44] isaranto: o/ preprocess() latency are super high :( [18:16:49] I can't repro either though [18:17:13] 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723905 (10isarantopoulos) @Q-bit-array I just managed to make the above request a couple of times. ` $ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:pr... [18:18:14] yes, given these preprocess latencies a timeout seems possible [18:18:40] I am going to save logs and then kill the pod [18:19:18] 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723914 (10Q-bit-array) Yes, now it works again. But we had 15-20 minutes of outage just before I posted here. [18:20:37] 2024-04-17 13:46:26.685 kserve.trace requestId: 8c556b9e-4aec-4ae8-a28a-23a2da6839fd, preprocess_ms: 558.698177338, explain_ms: 0, predict_ms: 1.362085342, postprocess_ms: 0.00166893 [18:20:49] it is surely a bug in revscoring, same behavior [18:22:19] ok saved logs in /home/elukey/T362503 on deploy1002 [18:22:33] or could be mwapi issue? [18:23:44] Tobias mentioned earlier today about an ongoing issue with mwapi but I never asked for more info [18:25:27] 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723930 (10elukey) Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503 I noticed high latency for preprocess(), plus a lot of the following: ` 2024-04-17 18:18:22.227 kserve.tra... [18:25:29] nono it is revscoring [18:25:43] even the other time everything was fine [18:25:49] plus we'd have seen the issue elsewhere [18:26:19] I checked the chans and no issue of outages [18:26:22] *no sign [18:27:09] ok. I will start debugging this early morning then. I downloaded the logs for now locally to check them [18:29:08] I'm thinking to also just log every payload from now on so at least next time this happens we have more info [18:31:27] thanks for deleting the pod Luca. Good night! [18:31:52] yeah I agree we can probably turn on the json payload log for ruwiki so we are prepared [18:31:55] thank you too! [18:33:12] isaranto: still seeing high latency though :( [18:33:51] the main problem seems to be get_revscoring_extractor_cache [18:39:08] (03PS1) 10Elukey: revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) [18:39:39] looking at the logs_kserve.log.. yeah there are a lot of keyError in revscoring/extractors/api/util.py [18:40:17] (03CR) 10CI reject: [V:04-1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [18:40:32] and a lot of "Missing resource for rev-id xxx" >> we can see which rev-id [18:41:08] ouch, yes we have high latency still [18:43:58] (03PS2) 10Elukey: revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) [18:44:38] aiko: afaics the latest requests have high preprocess latency without any log :( [18:45:21] I sent a patch but we'll likely not able to do much even if we get the right rev-id (at least now, it will require some revscoring check probably) [18:45:57] (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [18:46:06] so we can restart tomorrow, wdyt? [18:46:28] feel free to merge and deploy the above change in the morning [18:46:57] ok! [18:47:14] nice work [18:47:41] okkk [18:49:07] (03CR) 10AikoChou: [C:03+1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [19:12:44] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9724071 (10elukey) We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (featur... [19:13:52] checking the problematic rev-ids of “kserve.errors.InvalidInput: Missing resource for rev-id xxxxxxx: CommentDeleted: Comment deleted (datasource.revision.comment)" [19:13:57] there are a large amount of them in the logs [19:14:10] very weird they seem all comes from the same page [19:14:23] \https://ru.wikipedia.org/w/index.php?title=Участник:QBA-bot/Запросы_на_блокировку&action=history [19:15:27] I'm thinking maybe these edits caused too many errors and affected other requests [19:19:14] I've raised the min replicas for ruwiki to 4 and the max to 6 (was: 2/4) [19:19:22] there was already some autoscaling going on [19:19:55] with more replicas thing seems to improve [19:20:06] let' see [19:20:17] aiko: nice finding, let's discuss it tomorrow! [19:21:01] for the same problematic rev-ids, revscoring will re-try many times (I saw one rev-id retied 28 times) [19:21:11] maybe we can cut it when it happens for the first time [19:21:27] 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9724117 (10elukey) I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capac... [19:21:48] elukey: ok!