[06:26:03] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[06:33:29] <isaranto>	 Good morning o/
[06:37:24] <wikibugs>	 (03Merged) 10jenkins-bot: logo-detection: containerize model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1019773 (https://phabricator.wikimedia.org/T362598) (owner: 10Kevin Bazira)
[07:19:25] <wikibugs>	 06Machine-Learning-Team: Deploy logo-detection model-server to LiftWing staging - https://phabricator.wikimedia.org/T362749 (10kevinbazira) 03NEW
[07:20:26] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9721447 (10kevinbazira)
[07:29:51] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Prepare docker image for hosting the logo-detection model-server on LiftWing - https://phabricator.wikimedia.org/T362598#9721480 (10kevinbazira) The logo-detection model-server has been containerized and added to the the CI pipeline which published it successfully t...
[08:37:25] <klausman>	 Morning!
[08:40:27] <isaranto>	 \o
[09:03:13] <aiko>	 morning o/
[09:04:37] <klausman>	 heya, aiko \o
[09:09:09] <isaranto>	 heya!
[09:09:10] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Use Huggingface model server image for HF LLMs - https://phabricator.wikimedia.org/T357986#9721763 (10isarantopoulos) Tried to check the GPU after attaching to the running container and executing the following in a python console. I'm getting the same result: ` >>>...
[09:26:54] <wikibugs>	 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9721798 (10klausman) I've experimented a bit on Thanos, and arrived at this query:  ` (sum by (destination_canonical_service, prometheus) (rate(istio_requests_total{destination_canonical_...
[09:47:42] <klausman>	 There is currently an issue going on with the mwapi, which may affect our services as well. Looking at dashboards etc
[09:49:43] <isaranto>	 ack
[09:58:56] <klausman>	 I think we're good in the sense that there's nothing for us to do (differently). The underlying issue may be addressed for now, but proper resolution will take more time
[10:16:18] <isaranto>	 klausman: could you help me debug the GPU issue on ml-staging? 
[10:16:44] <isaranto>	 doesn't have to be now ofc!
[10:21:59] <klausman>	 Sure, can do!
[10:22:06] <klausman>	 maybe after lunch?
[10:25:00] <isaranto>	 Sure!
[10:25:03] <isaranto>	 thank you!
[10:35:45] * isaranto lunch!
[10:35:54] <klausman>	 ditto :)
[11:01:47] <wikibugs>	 (03CR) 10Matěj Suchánek: Exclude first/only revision on page from scoring (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[11:29:18] <wikibugs>	 (03PS1) 10Kevin Bazira: logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749)
[11:32:59] <elukey>	 hello folks
[11:36:13] <wikibugs>	 06Machine-Learning-Team, 10ORES: Create basic alerts for isvcs to catch outages - https://phabricator.wikimedia.org/T362661#9722143 (10elukey) There are two kinds of istio metrics - the ones from the  gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the...
[11:39:01] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira)
[11:39:14] <isaranto>	 hello!
[11:43:35] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira)
[11:47:50] <klausman>	 heya Luca!
[11:52:11] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira)
[11:52:54] <wikibugs>	 (03Merged) 10jenkins-bot: logo-detection: specify model name [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020710 (https://phabricator.wikimedia.org/T362749) (owner: 10Kevin Bazira)
[11:52:57] <klausman>	 isaranto: so what are you trying to do in staging? and how is it failing?
[11:57:47] <isaranto>	 I'm trying to use it in the same way we have been doing before and I get the failure reported here https://phabricator.wikimedia.org/T357986#9721763
[11:59:21] <klausman>	 do you ok, having a look
[12:02:19] <isaranto>	 the difference is the following: we are using a different base image , the pytorch image which has different rocm(rocm5.5) version than what is installed (rocm5.4.2)
[12:03:08] <klausman>	 mmh. that is a likely possibility
[12:03:36] <isaranto>	 yes, unfortunately this is the only thing I can think of at the moment
[12:05:07] <klausman>	 rocm5.4 is about a year old by now
[12:06:13] <klausman>	 It may be time to import 5.5 to the wmf repo
[12:07:29] <isaranto>	 we went with 5.5 as this is what is supported by the pytorch version we want for the huggingfaceserver (pytorch 2.1.2).  Otherwise we'd have to built pytorch ourselves 
[12:07:56] <klausman>	 yeah, I'd aotherwise also consider the newer versions of rocm, but I don't want to stray too far
[12:08:53] <elukey>	 before jumping to any conclusion, we should verify how/if the k8s worker version of the driver affects the one running inside the container
[12:09:23] <elukey>	 in theory the k8s worker one shouldn't count a lot
[12:10:01] <isaranto>	 another issue could just be that the os user can't access the GPU drivers
[12:10:19] <klausman>	 But then the other GPU models we've tried would also fail, no?
[12:10:35] <klausman>	 GPU-based models I mean
[12:11:29] <isaranto>	 it is a different/new image
[12:11:40] <klausman>	 ah, good point
[12:12:46] <klausman>	 I'm currently trying to figure out what the control point of GPU access is (i.e. what device or what user groups are usually used)
[12:14:05] <isaranto>	 in the other images (llm image) we install `libdrm-amdgpu1` through blubber. In the pytorch image we install it in the production images https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/amd/pytorch21/Dockerfile.template#5
[12:15:55] <klausman>	 Mh, that chain of user modifications is hard to read, gimme a moment to try and uderstand what it does.
[12:16:28] <elukey>	 it is all explained in the readme
[12:16:46] <isaranto>	 sure, I'm just pasting info and explaining my thoughts
[12:16:59] <elukey>	 yep yep :)
[12:17:35] <elukey>	 I was adding context since IIRC Tobias was afk when we merged
[12:18:48] <klausman>	 I presume we just assume that the correct group for user somebody is also always somebody?
[12:19:12] <elukey>	 what do you mean?
[12:19:21] <klausman>	 `chown {{ "somebody" | uid }}:{{ "somebody" | uid }} /opt/lib`
[12:19:34] <elukey>	 that is what blubber does, we had to replicate it
[12:19:37] <klausman>	 note that the group (after `:`) is also using `uid`
[12:19:47] <klausman>	 Ack
[12:21:05] <klausman>	 In a non-docer system the user would to also be in the render video groups.
[12:21:54] <elukey>	 yep but we apply a special permission to allow other to read the devices (kfd etc..)
[12:23:11] <klausman>	 Where do we do that?
[12:23:26] <elukey>	 it is in the puppet config for the amd plugin
[12:24:05] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/909968 IIRC
[12:24:40] <elukey>	 I suspect the issue is between the kubelet and the k8s plugin, ok if I restart them?
[12:24:59] <klausman>	 no objections from me
[12:25:01] <elukey>	 there are some horrors logged by the kubelet
[12:25:47] <elukey>	 isaranto: can you try to delete the pod?
[12:25:58] <isaranto>	 yes, on it
[12:27:02] <isaranto>	 done
[12:27:27] <elukey>	 I don't see horrors, do you see the gpu now?
[12:27:31] <isaranto>	 elukey: do you want me to remove it completely for now? cause ofc it is recreated if I delete it
[12:27:56] <elukey>	 nono just shake it to see if the gpu gets picked up
[12:28:18] <isaranto>	 it will take some time for the pod to start
[12:28:24] <klausman>	 It's still in the terminating phase
[12:28:38] <klausman>	 and until it's gone, there is no GPU for the new one
[12:29:06] <klausman>	 (unless it can somehow signal back to the kubelet that it didn't use the GPU after all)
[12:31:45] <isaranto>	 it is gone now
[12:31:55] <klausman>	 Yep, new one is on init
[12:33:50] <chrisalbon>	 Goood morning all
[12:33:57] <klausman>	 heyo Chris
[12:34:59] <klausman>	 isaranto: I have a question about the model download.
[12:35:26] <klausman>	 So I see it fetches llm/Mistral-7B-Instruct-v0.2/model-00001-of-00003.safetensors (and two more, but also  llm/Mistral-7B-Instruct-v0.2/pytorch_model-00001-of-00003.bin (and tow more)
[12:35:37] <klausman>	 Do we need both types?
[12:36:49] <klausman>	 Also, still has the amdgpu_get_auth error in the logs
[12:37:33] <isaranto>	 good question! no we don't! but wanted to run some checks before I left just one of them
[12:37:42] <klausman>	 sure, no worries.
[12:37:48] <isaranto>	 safetensors load time is much faster so probably we'll keep that one
[12:37:51] <elukey>	 pod is up
[12:38:49] <klausman>	 elukey: when you said you saw horrors in the kubelet log, I presume you mean the one accessible via journalctl?
[12:38:57] <elukey>	 yep
[12:39:19] <klausman>	 Didn't see anything obviously bad in the last few minutes, so those were likely unrelated :-/
[12:39:48] <elukey>	 the amd-k8s-plugin is a daemon that has to create a unix socket to be contacted by the kubelet, and in turns it registers itself to the kubelet 
[12:40:43] <wikibugs>	 (03PS12) 10Jsn.sherman: Exclude first/only revision on page from scoring [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281)
[12:40:49] <elukey>	 I noticed something like
[12:40:49] <elukey>	 Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/amd.com_gpu: connect: no such file or directory". Reconnecting...
[12:41:15] <klausman>	 Daemon not running?
[12:41:24] <elukey>	 plus the k8s plugin daemon seems to restart a lot, at least from what systemctl reports
[12:41:27] <elukey>	 not sure why
[12:41:43] <elukey>	 isaranto: can you check if the gpu is recognized?
[12:41:46] <klausman>	 what's the unit name?
[12:42:07] <elukey>	 you can grep for k8s or gpu and you should find it, I always forget it
[12:42:22] <klausman>	 ah, amd-k8s-device-plugin.service
[12:42:34] <wikibugs>	 (03CR) 10Jsn.sherman: Exclude first/only revision on page from scoring (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014572 (https://phabricator.wikimedia.org/T356281) (owner: 10Jsn.sherman)
[12:42:53] <klausman>	 I did a ps ax|grep amd earlier and it wasn't there, so I suspected it may have a more obscure name
[12:43:07] <isaranto>	 I'm checking
[12:45:47] <elukey>	 it is called amd-k8s-device-plugin
[12:46:21] <klausman>	 yepyep, found it
[12:47:29] <klausman>	 It doesn't seem to be logging anything at all
[12:49:28] <isaranto>	 I found various stuff searching for gpu via `find / -name "*gpu*"`. I'm doing this by attaching a shell to the running container so i dn't have access to a lot of commands
[12:51:34] <elukey>	 is it ok if I temporarily test in staging mw-api connectivity? The current testing shouldn't be impacted
[12:51:44] <klausman>	 /dev/kfd perms inside the container look right (0o666)
[12:51:46] <elukey>	 I need to drop some configs to verify one doubt
[12:52:03] <elukey>	 klausman: check also the gpu devices
[12:52:18] <klausman>	 aha!
[12:52:24] <elukey>	 /dev/dri/render*
[12:52:26] <klausman>	 $ ls -l /dev/dri/
[12:52:28] <klausman>	 total 0
[12:52:30] <klausman>	 crw-rw---- 1 root video 226,   1 Apr 17 12:36 card1
[12:52:32] <klausman>	 crw-rw-rw- 1 root   106 226, 128 Apr 17 12:36 renderD128
[12:52:49] <klausman>	 would it need access to card1? Or only render*?
[12:53:12] <elukey>	 IIRC only render, but let's wait for isaranto to confirm
[12:53:16] <elukey>	 if it works or not
[12:53:44] <klausman>	 that GID 106 is also odd, but shoudln't break anything because it's 666
[12:53:55] <klausman>	 kfd has the same GID
[12:54:50] <isaranto>	 elukey: to confirm what? sry I'm lost a bit
[12:55:00] <elukey>	 isaranto: if the gpu is recognized now or not :)
[12:55:07] <elukey>	 on the pod I mean
[12:55:19] <klausman>	 The logs for the service mentioned the same permission error, so I doubt it
[12:56:36] <isaranto>	 the gpu is there in the resources and is assigned to the pod, but from python I get the same errors
[12:57:17] <isaranto>	 I dont know if I have another way to validate the gpu status from within the pod
[12:59:19] <elukey>	 klausman: what service do you mean? To check the logs
[12:59:20] <isaranto>	 elukey: feel free to work on mw-api connectivity (never answered your previous question about it)
[12:59:22] <elukey>	 the kubelet?
[12:59:38] <klausman>	 kubectl logs -n exp mistral...
[12:59:47] <klausman>	 kubectl logs -n experimental mistral-7b-instruct-gpu-predictor-00005-deployment-5d6676d9lb6d
[12:59:49] <elukey>	 okok so also the pods complains, didn't now
[13:00:12] <klausman>	 yes, it's among the very first messages during startup
[13:00:31] <elukey>	 I see thanks
[13:00:32] <klausman>	 `amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)` four times, then it continues to load the model etc
[13:05:34] <klausman>	 elukey: so I did something super hacky and copied radeontop and libpciaccess.so into the container and tried to run it. It seems to work, but I am not sure that is a sufficient test to prove anything
[13:06:19] <klausman>	 It definitely recognises the GPU as an ARCTURUS modelm as well as the the right amount of RAM
[13:06:45] <elukey>	 klausman: how did you copy binaries?
[13:06:50] <klausman>	 docker cp
[13:07:02] <klausman>	 docker cp /sbin/radeontop 0374a90bad66:/sbin 
[13:07:03] <elukey>	 okok, not great but it is staging
[13:07:18] <elukey>	 also it is bullseye vs bookworm, but should work anyway
[13:07:35] <klausman>	 yeah, hence, hacky. I was hoping it'd break and give a more useful error message than what we have so far
[13:10:43] <elukey>	 I checked on the nllb pod that we are running on ml-serve1001 with the gpu, and 
[13:10:47] <elukey>	 crw-rw---- 1 root video 226,   2 Feb  7 14:10 card2
[13:10:49] <elukey>	 crw-rw-rw- 1 root   106 226, 129 Feb  7 14:10 renderD129
[13:10:54] <elukey>	 so the perms checks out
[13:11:06] <klausman>	 is that pod's user in the video group maybe?
[13:11:42] <elukey>	 it should be somebody, same user that we have elsewhere
[13:12:39] <isaranto>	 I'm jumping in a call with Mercelis and will meet yall in our meeting later. thanks for all the help!
[13:14:42] <klausman>	 One thought: we've been assuming this is a permission error since the message mentions `amdgpu_get_auth`, but it may be a more general failure
[13:16:22] <elukey>	 at this point it maybe be a lot of things, but from the libdrm's code it seems that the result should be an fd pointing to the device
[13:16:39] <elukey>	 https://github.com/grate-driver/libdrm/blob/master/amdgpu/amdgpu_device.c#L80
[13:16:40] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman)
[13:28:04] <klausman>	 it's a shame we don't have the errno that is set in that ioctl context, it would tell us if it's actually an EPERM or something else.
[13:28:38] <klausman>	 at least we know it's an OS-level error (<0), not a driver error (>0)
[13:29:54] <wikibugs>	 (03Merged) 10jenkins-bot: update revertrisk-language-agnostic min & desc [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1014519 (https://phabricator.wikimedia.org/T348298) (owner: 10Jsn.sherman)
[13:38:37] <wikibugs>	 (03PS1) 10AikoChou: revertrisk: add support for base model's payloads in batch model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744)
[13:54:01] <wikibugs>	 (03CR) 10AikoChou: "Now we have two choices: 1) create a new endpoint for the batch model and don't touch the original RRLA model server or 2) use this patch " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020835 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[14:25:21] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of isvcs - https://phabricator.wikimedia.org/T362674#9722637 (10isarantopoulos)
[14:26:07] <wikibugs>	 06Machine-Learning-Team, 05Goal: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services - https://phabricator.wikimedia.org/T362674#9722640 (10isarantopoulos)
[14:49:35] <elukey>	 ok I just found out something weird
[14:49:57] <elukey>	 in our istio sidecar config, using api-ro.discovery.wmnet:80 (with the implicit proxy to :443) works
[14:50:16] <elukey>	 but if we use the new mw-api-int-ro.discovery.wmnet:4460 it doesn't 
[14:50:54] <elukey>	 I kinda know why the new doesn't work, in fact there is a solution to make it work
[14:50:58] <elukey>	 not sure why the current config works though
[14:52:58] <elukey>	 ah okok it seems a matter of using 80 vs specifying a port like 4680
[14:53:01] * elukey sigh
[14:53:08] <klausman>	 maybe implicit rules somewhere?
[14:56:07] <elukey>	 in theory no, seems more a bug
[14:58:56] <elukey>	 like https://github.com/istio/istio/issues/21914
[14:59:54] <klausman>	 hmm.
[15:00:00] <elukey>	 seems to be another obscure istio "Feature"
[15:00:10] * elukey brb
[15:01:31] <klausman>	 I straced the simple test for torch/cuda in Python and in the log I see:
[15:01:38] <klausman>	 access("/dev/dri/renderD128", F_OK)     = -1 EPERM (Operation not permitted)
[15:02:45] <klausman>	 So there is definitely a permission issue
[15:03:22] <isaranto>	 aha!
[15:09:17] <klausman>	 this is wild. F_OK on access() only checks for the existence of a file, not actual access bits
[15:10:29] <klausman>	 And even if that is normal, the next syscall is:
[15:10:41] <klausman>	 ioctl(7, DRM_IOCTL_GET_CLIENT, 0x7fffeea94eb0) = -1 EACCES (Permission denied)
[15:11:11] <klausman>	 I'll pastebin the whoel sequence
[15:11:58] <klausman>	 https://phabricator.wikimedia.org/P60791
[15:12:48] <klausman>	 Note how after the ioctl, we immediately see the error message being emitted
[15:18:12] <elukey>	 did you strace directly on the ml-staging node?
[15:18:23] <klausman>	 I did the same hack again of copying in the binary
[15:18:39] <klausman>	 I don't trust tracing from host into container
[15:18:54] <elukey>	 strace should work without the hack, you can use the pid of the container
[15:19:13] <elukey>	 (same for perf etc.. I used it in the past)
[15:19:23] <klausman>	 if you want to take a look at the log, it's in my homedir there as strace-gpu.out
[15:19:43] <klausman>	 Note that it's the whole thing from python starting, so somewhat long
[15:39:00] <klausman>	 gotta run an errand, bbiab
[15:40:13] <isaranto>	 I'm going afk folks, lemme know if I should do/try anything else for the gpu, I'll check later. cu tomorrow!
[15:47:18] <wikibugs>	 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9723080 (10elukey) Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than...
[15:47:19] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9723070 (10elukey) Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging fo...
[15:47:30] <elukey>	 I added some thoughts about today's rabbit hole with istio in https://phabricator.wikimedia.org/T353622#9723070
[15:47:45] <elukey>	 the move to mw-api-int-ro will be a little more complex than anticipated
[15:53:43] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622#9723267 (10elukey) Overall steps:  1) Revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020738 since it is a...
[15:55:32] <elukey>	 at least now everything makes kind-of sense
[15:56:15] <elukey>	 I'll let others verify that I am not totally wrong before proceeding :D
[16:06:17] <klausman>	 I'll send feedback tomorrow morning if that is soon enough? Currently my brain is still in GPUs-and-EPERM mode :D
[16:07:42] <elukey>	 even later, not super urgent, I'll prep changes etc..
[16:07:53] <klausman>	 ok, good.
[16:08:06] <elukey>	 it is just one more service entry, hopefully the repetition of the hostnames will be handled via yaml anchors
[16:11:04] <elukey>	 have a nice rest of the day folks! logging off
[16:11:12] <klausman>	 seeya
[17:55:13] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723796 (10Q-bit-array) Just to let you know - LiftWing/ORES has crashed again  `name=curl output example $ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:...
[18:10:16] <isaranto>	 o/ 
[18:11:15] <isaranto>	 regarding the above message I'm able to get a response https://phabricator.wikimedia.org/P60811
[18:16:44] <elukey>	 isaranto: o/ preprocess() latency are super high :(
[18:16:49] <elukey>	 I can't repro either though
[18:17:13] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723905 (10isarantopoulos) @Q-bit-array I just managed to make the above request a couple of times.   ` $ curl https://api.wikimedia.org/service/lw/inference/v1/models/ruwiki-damaging:pr...
[18:18:14] <isaranto>	 yes, given these preprocess latencies a timeout seems possible
[18:18:40] <elukey>	 I am going to save logs and then kill the pod
[18:19:18] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723914 (10Q-bit-array) Yes, now it works again. But we had 15-20 minutes of outage just before I posted here.
[18:20:37] <elukey>	 2024-04-17 13:46:26.685 kserve.trace requestId: 8c556b9e-4aec-4ae8-a28a-23a2da6839fd, preprocess_ms: 558.698177338, explain_ms: 0, predict_ms: 1.362085342, postprocess_ms: 0.00166893
[18:20:49] <elukey>	 it is surely a bug in revscoring, same behavior
[18:22:19] <elukey>	 ok saved logs in /home/elukey/T362503 on deploy1002
[18:22:33] <isaranto>	 or could be mwapi issue?
[18:23:44] <isaranto>	 Tobias mentioned earlier today about an ongoing issue with mwapi but I never asked for more info
[18:25:27] <wikibugs>	 06Machine-Learning-Team, 10ORES: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9723930 (10elukey) Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503  I noticed high latency for preprocess(), plus a lot of the following:  ` 2024-04-17 18:18:22.227 kserve.tra...
[18:25:29] <elukey>	 nono it is revscoring
[18:25:43] <elukey>	 even the other time everything was fine
[18:25:49] <elukey>	 plus we'd have seen the issue elsewhere
[18:26:19] <elukey>	 I checked the chans and no issue of outages
[18:26:22] <elukey>	 *no sign
[18:27:09] <isaranto>	 ok. I will start debugging this early morning then. I downloaded the logs for now locally to check them
[18:29:08] <isaranto>	 I'm thinking to also just log every payload from now on so at least next time this happens we have more info
[18:31:27] <isaranto>	 thanks for deleting the pod Luca. Good night!
[18:31:52] <elukey>	 yeah I agree we can probably turn on the json payload log for ruwiki so we are prepared
[18:31:55] <elukey>	 thank you too!
[18:33:12] <elukey>	 isaranto: still seeing high latency though :(
[18:33:51] <elukey>	 the main problem seems to be get_revscoring_extractor_cache
[18:39:08] <wikibugs>	 (03PS1) 10Elukey: revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503)
[18:39:39] <aiko>	 looking at the logs_kserve.log.. yeah there are a lot of keyError in revscoring/extractors/api/util.py
[18:40:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey)
[18:40:32] <aiko>	 and a lot of "Missing resource for rev-id xxx" >> we can see which rev-id
[18:41:08] <isaranto>	 ouch, yes we have high latency still
[18:43:58] <wikibugs>	 (03PS2) 10Elukey: revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503)
[18:44:38] <elukey>	 aiko: afaics the latest requests have high preprocess latency without any log :(
[18:45:21] <elukey>	 I sent a patch but we'll likely not able to do much even if we get the right rev-id (at least now, it will require some revscoring check probably)
[18:45:57] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey)
[18:46:06] <elukey>	 so we can restart tomorrow, wdyt?
[18:46:28] <elukey>	 feel free to merge and deploy the above change in the morning
[18:46:57] <isaranto>	 ok!
[18:47:14] <isaranto>	 nice work
[18:47:41] <aiko>	 okkk
[18:49:07] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] revscoring: add flag to log JSON inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1020898 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey)
[19:12:44] <wikibugs>	 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9724071 (10elukey) We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (featur...
[19:13:52] <aiko>	 checking the problematic rev-ids of “kserve.errors.InvalidInput: Missing resource for rev-id xxxxxxx: CommentDeleted: Comment deleted (datasource.revision.comment)"
[19:13:57] <aiko>	 there are a large amount of them in the logs
[19:14:10] <aiko>	 very weird they seem all comes from the same page
[19:14:23] <aiko>	 \https://ru.wikipedia.org/w/index.php?title=Участник:QBA-bot/Запросы_на_блокировку&action=history
[19:15:27] <aiko>	 I'm thinking maybe these edits caused too many errors and affected other requests
[19:19:14] <elukey>	 I've raised the min replicas for ruwiki to 4 and the max to 6 (was: 2/4)
[19:19:22] <elukey>	 there was already some autoscaling going on
[19:19:55] <elukey>	 with more replicas thing seems to improve
[19:20:06] <elukey>	 let' see
[19:20:17] <elukey>	 aiko: nice finding, let's discuss it tomorrow!
[19:21:01] <aiko>	 for the same problematic rev-ids, revscoring will re-try many times (I saw one rev-id retied 28 times) 
[19:21:11] <aiko>	 maybe we can cut it when it happens for the first time 
[19:21:27] <wikibugs>	 06Machine-Learning-Team, 10ORES, 13Patch-For-Review: ORES doesn't work (at least for ru- and ukwiki) - https://phabricator.wikimedia.org/T362503#9724117 (10elukey) I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capac...
[19:21:48] <aiko>	 elukey: ok!