[07:11:26] (03CR) 10Elukey: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [08:04:16] elukey: o/ is there another repo than ^ than can generate events with the schema /mediawiki/revision/score? [08:06:36] ah yes: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/+/b9c6ed342402c5befec24b961b008b8b2d9d4aa5/sys/ores_updates.js [08:07:08] dcausse: o/ yeah change prop, we are about to decom the stream, but we'll need to wait mediawiki enterprise to migrate over to lift wing first [08:08:03] sure, should I bother keeping it up-to-date if I create a new version of /mediawiki/revision/score? [08:18:51] dcausse: I'd say no, eventgate will accept the old schema version right? I'd keep it as it is, we don't want to mess with that stream [08:21:00] elukey: yes event-gate should be fine, makes sense I'll leave it as is then, thansk! [09:51:48] trying to re-deploy falcon, I'd like to get why it fails [09:52:07] my suspicion is that the current pytorch rocm hip code fails in some weird way [09:58:20] "Falcon-40B requires ~90GB of GPU memory — that’s a lot, but still less than LLaMA-65B, which Falcon outperforms. " [09:58:24] ahahahahh [09:59:07] ahyes, 90GB of VRAM. Like everyone has that lying around :D [09:59:23] the nvidia a100 has 80G, I am not sure what kind of GPUs one need for falcon or llama [09:59:51] I'm not sure I have seen any GPU with more than what the A100 has. [10:00:14] I figure maybe places like the Goo have internal/proprietary accelerators that do. [10:01:32] IIUC one could load some models in multiple gpus [10:01:52] Ah, so sharding. Still, that's a lot of $$$ just for VRAM [10:02:14] Are there Faclon variants that would fit in 16G? [10:02:37] (03PS2) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) [10:02:54] in theory the 7B should fit on a 16G card [10:02:57] I mean, even with the biggest consumer GPU (3090Ti, 24GB), you'd need four of them to run thus. [10:03:05] (the 40B model) [10:04:09] So this is still using pytorch, right? [10:04:23] in theory yes [10:05:12] One thing I've seen people mention is the number of workers can make a big difference (e.g. 4 workers is fine 12 is not, on a 8GB GPU) [10:05:27] workers? [10:06:05] I don't know enough about how we use pytorch to even know if that is a parameter we have access to, or if it makes sense at all [10:06:42] https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 This is where I cama across it, though it's a different model, etc [10:07:27] https://pytorch.org/docs/stable/data.html This being the APi call in question [10:08:51] this is different , as it is referring to training where you need to load a big chunk of data to the processing unit where the model lives (in this case GPU memory) [10:08:57] But reading the docs, we probably have num_workers=1, and that should use the smallest amount of memory, so probably a red herring. [10:09:07] ah, ok, TIL [10:14:40] (03CR) 10DCausse: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [10:14:59] (03PS2) 10AikoChou: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 [10:18:31] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou) [10:24:27] (03Merged) 10jenkins-bot: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou) [10:27:03] elukey: I think in the context of "did this load correctly into the GPU", it might also be useful to query the current memory usage of a GPU, as an indicator that something else is already using it. (unless of course we have a k8s-side method of ensuring that it isn't). [10:27:31] klausman: in theory we get a 1:1 relationship between gpu and a pod, nothing else uses it [10:27:32] With the "don't have a GPU to run on" problems we've seen, I think the latter should be the case [10:27:45] ack [10:28:34] Wonder if a GPU could ever end up in a state where VRAM is counted as allocated/used, but k8s thinks the GPU is free to use and then loading the model fails for a "lack" of VRAM [10:29:05] (in case you haven't noticed, I have very little knowledge of how memory management of VRAM is done these days :)) [10:32:21] me too, it is good to discuss :) [10:41:43] so I found something interesting while talking with SRE [10:41:58] with docker inspect etc.. I can see the following [10:41:59] /var/lib/kubelet/pods/2d466283-e123-4210-bce8-02393ccc14ba/volumes/kubernetes.io~empty-dir/kserve-provision-location [10:42:19] this is an example of emptyDir mount to /mnt/models, that is used by the storage initializer [10:42:27] /dev/mapper/vg0-kubelet 28G 13G 14G 48% /var/lib/kubelet [10:42:32] * elukey cries in a corner [10:42:51] I was convinced that emptyDirs would end up under /var/lib/docker [10:42:58] so this is probably why falcon is failing [10:43:14] you mean disk space is the issue? not (V)RAM? [10:43:21] yeah [10:43:26] there are some indications of it [10:43:30] File "/usr/local/lib/python3.9/dist-packages/s3transfer/utils.py", line 375, in write [10:43:33] self._fileobj.write(data) [10:43:36] OSError: [Errno 28] No space left on device [10:43:58] yeah. ENOSPC doesn't trigger on RAM issues (doub tit would for VRAM, but not sure) [10:44:10] if it was host RAM, it'd be ENOMEM [10:44:38] yes that one I know [10:44:58] but the current status of pods is weird, a lot of different errors, this one is in one of them [10:45:15] falcon-7b-instruct-gpu-predictor-default-00001-deployment-hhbx5 0/3 ContainerStatusUnknown 3 9m10s [10:45:18] falcon-7b-instruct-gpu-predictor-default-00001-deployment-hng4g 0/3 Init:ContainerStatusUnknown 4 (44m ago) 50m [10:45:21] falcon-7b-instruct-gpu-predictor-default-00001-deployment-zmqkn 0/3 ContainerStatusUnknown 8 (12m ago) 29m [10:45:25] and it varies over time :D [10:45:39] but it may all be due to the kubelet being under disk pressure [10:46:10] we have around 90G free to expand /var/lib/kubelet in theory [10:46:11] yeah, I was about to say, with parallel disk accesses and no space, all manners of weird state can happen [10:46:26] no idea if SRE wanted to keep it to 40G for some reason [10:46:33] probably not, we have new use cases [10:46:49] klausman: ok if I try to expand ml-serve1001's kubelet partition? [10:47:01] No objections [10:47:20] I wonder if we should unify the backing store for /var/lib/docker and /var/lib/kubelet [10:49:01] elukey: o/ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930000 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930002 are patches to add outlink stream to the changeprop prod [10:49:05] We also have the 2x1.8T spinning rust in those machines that we're currently not using [10:49:28] I'm not sure if we can only autoscale the outlink transformer, or we need to config both the transformer and predictor. outlink prediction only takes a little time and most of the time is spent on preprocessing [10:50:19] klausman: do you have time to check -^ [10:50:23] on it [10:51:49] both LGTM [10:52:24] aiko_: with the recent change, you should be able to +2 them yourself, and Jerkins will merge it [10:52:27] klausman: Aiko asked another question about scaling the transformer vs predictor, can you follow up to see if we can scale them up separately? [10:52:33] before merging etc.. [10:52:38] oh, sorry, missed that [10:52:47] Looking into it [10:56:23] super thanks [10:56:57] At first glance, I don't think there should be a problem there (running fewer predictors than transformers. But I'll keep digging a bit [11:01:44] aiko_: I think we can just try it on staging, see what happens. I see no obvious reason why it shouldn't work. [11:02:24] (03CR) 10Ilias Sarantopoulos: "For more information on Models check the docs https://fastapi.tiangolo.com/tutorial/response-model/" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [11:08:19] klausman: we can't test it on staging as we don't have continuous stream of events to test in changeprop staging, so it's likely not to use autoscaling [11:08:49] Hm, good point [11:10:01] One thing we could do is to keep around a copy of the old deployment chart files, deploy the new state, see if there are errors and quickly roll back using the old files (copying them back into the /srv tree) if there are problems (and then doa a proper Gerrit rollack PR). [11:19:44] klausman: ok, I'll go ahead to give +2 then. when it's merged, can you help deploy to changeprop production? thanks! [11:20:08] sure, can do [11:27:33] <- lunch (we'll deploy afterwards) [12:08:11] 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) The documentation ([[ https://meta.wikimedia.org/wiki/Recommendation_API | 1 ]], [[ https://github.com/wikimedia/research-recommendation-api/blob/master/README.md | 2 ]]) doesn... [12:16:27] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) We'll need to check which of the following errors we need to support (if not all of them) https://github.com/wikimedia/revscoring/blob/master/re... [12:16:39] (03PS1) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [12:17:50] (03CR) 10CI reject: [V: 04-1] ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [12:23:02] (03PS2) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [12:25:20] (03PS3) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [12:25:59] (03CR) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [12:26:54] * isaranto lunchtime [12:55:32] (03CR) 10Ottomata: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [12:56:08] (03CR) 10Ottomata: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [13:01:11] expanded the kubelet partition with 80g more [13:14:06] ok way better now, falcon is still failing but not like the last time [13:14:50] it seems that it fails to bootstrap in time, the health checks are possibly a little aggressive [13:15:17] Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [13:15:20] Loading checkpoint shards: 0%| | 0/2 [00:00 and then it fails [13:19:48] elukey: unrelatedly, the first bit of the articletopic update (replicacount and versions) is done [13:20:26] klausman: nice, so the transformer and predictor can scale up differently? [13:20:46] Yes [13:20:51] very nice [13:21:10] Whether that works well unde rheavily load, we'll have to see. Startup time of new replicas etc [13:32:02] 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) The other thing that I just noticed is that this service [consumes 0.4% of the resources it is allocated](https://grafana.wikimedia.org/d/Y5wk80oG... [13:40:57] elukey: ware you still working on ml-s1001? [13:41:20] elukey: AM shows a high latency alert for exec_sync: https://alerts.wikimedia.org/?q=alertname%3DKubeletOperationalLatency&q=team%3Dsre&q=%40receiver%3Ddefault [13:48:49] and it stopped firing just now [13:49:04] er, no, it didn't, I am just fatfingering Alertmanager :) [15:00:48] elukey: its' weird, the alert is firing (mentioning ~1s latency, but the Grafana dashboards don't support that at all [15:30:29] elukey: sry just saw your messages above regarding falcon [15:30:54] I also checked logs...the loading thing that gets stuck is the time that the model is being loaded on GPU [15:31:01] hmm [15:31:18] actually no.. [15:31:47] isaranto: the main issue IIUC is that the readiness probe is too aggressive [15:31:50] we do it in 2 steps, so we first load the model on cpu memory and then load it to the GPU via the todevice call [15:31:57] I am trying to add more time but it seems not working [15:32:09] ok, if it is that then fine [15:32:55] (03PS7) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) [15:33:41] btw folks I added some more things in the above patch for ores-legacy. Specifically some example responses defined in a json file that are used in Swagger [15:37:45] 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) I have also added some example responses in a json file. These are defined as examples in FastAPI endpoints and are used to show sample response... [16:09:58] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930209 that should allow us to set the readiness probe [16:16:57] nice! [16:26:13] (03CR) 10Elukey: [C: 03+1] "If we change anything on the Lift Wing side and forget to fix this it will not work, so maybe as follow up let's add a comment in the Lift" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [16:28:36] going afk for the evening folks! [16:28:37] o/ [16:53:11] 10Machine-Learning-Team, 10Research-Backlog, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog: [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10MarkTraceur) [16:55:43] o/ [16:58:35] (03PS4) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) [16:59:19] (03CR) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [20:02:20] (03PS3) 10DCausse: Unify the meta subfield in events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 [20:02:22] (03PS3) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) [20:03:00] (03CR) 10DCausse: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse) [20:09:17] (03CR) 10CI reject: [V: 04-1] events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse)