[07:11:26] <wikibugs>	 (03CR) 10Elukey: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[08:04:16] <dcausse>	 elukey: o/ is there another repo than ^ than can generate events with the schema /mediawiki/revision/score?
[08:06:36] <dcausse>	 ah yes: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/+/b9c6ed342402c5befec24b961b008b8b2d9d4aa5/sys/ores_updates.js
[08:07:08] <elukey>	 dcausse: o/ yeah change prop, we are about to decom the stream, but we'll need to wait mediawiki enterprise to migrate over to lift wing first
[08:08:03] <dcausse>	 sure, should I bother keeping it up-to-date if I create a new version of /mediawiki/revision/score?
[08:18:51] <elukey>	 dcausse: I'd say no, eventgate will accept the old schema version right? I'd keep it as it is, we don't want to mess with that stream
[08:21:00] <dcausse>	 elukey: yes event-gate should be fine, makes sense I'll leave it as is then, thansk!
[09:51:48] <elukey>	 trying to re-deploy falcon, I'd like to get why it fails
[09:52:07] <elukey>	 my suspicion is that the current pytorch rocm hip code fails in some weird way
[09:58:20] <elukey>	 "Falcon-40B requires ~90GB of GPU memory — that’s a lot, but still less than LLaMA-65B, which Falcon outperforms. "
[09:58:24] <elukey>	 ahahahahh
[09:59:07] <klausman>	 ahyes, 90GB of VRAM. Like everyone has that lying around :D
[09:59:23] <elukey>	 the nvidia a100 has 80G, I am not sure what kind of GPUs one need for falcon or llama
[09:59:51] <klausman>	 I'm not sure I have seen any GPU with more than what the A100 has.
[10:00:14] <klausman>	 I figure maybe places like the Goo have internal/proprietary accelerators that do.
[10:01:32] <elukey>	 IIUC one could load some models in multiple gpus
[10:01:52] <klausman>	 Ah, so sharding. Still, that's a lot of $$$ just for VRAM
[10:02:14] <klausman>	 Are there Faclon variants that would fit in 16G?
[10:02:37] <wikibugs>	 (03PS2) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648)
[10:02:54] <elukey>	 in theory the 7B should fit on a 16G card
[10:02:57] <klausman>	 I mean, even with the biggest consumer GPU (3090Ti, 24GB), you'd need four of them to run thus.
[10:03:05] <klausman>	 (the 40B model)
[10:04:09] <klausman>	 So this is still using pytorch, right?
[10:04:23] <elukey>	 in theory yes
[10:05:12] <klausman>	 One thing I've seen people mention is the number of workers can make a big difference (e.g. 4 workers is fine 12 is not, on a 8GB GPU)
[10:05:27] <elukey>	 workers?
[10:06:05] <klausman>	 I don't know enough about how we use pytorch to even know if that is a parameter we have access to, or if it makes sense at all
[10:06:42] <klausman>	 https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 This is where I cama across it, though it's a different model, etc
[10:07:27] <klausman>	 https://pytorch.org/docs/stable/data.html This being the APi call in question
[10:08:51] <isaranto>	 this is different , as it is referring to training where you need to load a big chunk of data to the processing unit where the model lives (in this case GPU memory)
[10:08:57] <klausman>	 But reading the docs, we probably have num_workers=1, and that should use the smallest amount of memory, so probably a red herring.
[10:09:07] <klausman>	 ah, ok, TIL
[10:14:40] <wikibugs>	 (03CR) 10DCausse: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[10:14:59] <wikibugs>	 (03PS2) 10AikoChou: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175
[10:18:31] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou)
[10:24:27] <wikibugs>	 (03Merged) 10jenkins-bot: revert-risk: change output schema and add model version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929175 (owner: 10AikoChou)
[10:27:03] <klausman>	 elukey: I think in the context of "did this load correctly into the GPU", it might also be useful to query the current memory usage of a GPU, as an indicator that something else is already using it. (unless of course we have a k8s-side method of ensuring that it isn't).
[10:27:31] <elukey>	 klausman: in theory we get a 1:1 relationship between gpu and a pod, nothing else uses it
[10:27:32] <klausman>	 With the "don't have a GPU to run on" problems we've seen, I think the latter should be the case
[10:27:45] <klausman>	 ack
[10:28:34] <klausman>	 Wonder if a GPU could ever end up in a state where VRAM is counted as allocated/used, but k8s thinks the GPU is free to use and then loading the model fails for a "lack" of VRAM
[10:29:05] <klausman>	 (in case you haven't noticed, I have very little knowledge of how memory management of VRAM is done these days :))
[10:32:21] <elukey>	 me too, it is good to discuss :)
[10:41:43] <elukey>	 so I found something interesting while talking with SRE
[10:41:58] <elukey>	 with docker inspect etc.. I can see the following
[10:41:59] <elukey>	 /var/lib/kubelet/pods/2d466283-e123-4210-bce8-02393ccc14ba/volumes/kubernetes.io~empty-dir/kserve-provision-location
[10:42:19] <elukey>	 this is an example of emptyDir mount to /mnt/models, that is used by the storage initializer
[10:42:27] <elukey>	 /dev/mapper/vg0-kubelet   28G   13G   14G  48% /var/lib/kubelet
[10:42:32] * elukey cries in a corner
[10:42:51] <elukey>	 I was convinced that emptyDirs would end up under /var/lib/docker
[10:42:58] <elukey>	 so this is probably why falcon is failing
[10:43:14] <klausman>	 you mean disk space is the issue? not (V)RAM?
[10:43:21] <elukey>	 yeah
[10:43:26] <elukey>	 there are some indications of it
[10:43:30] <elukey>	   File "/usr/local/lib/python3.9/dist-packages/s3transfer/utils.py", line 375, in write                                                    
[10:43:33] <elukey>	     self._fileobj.write(data)
[10:43:36] <elukey>	 OSError: [Errno 28] No space left on device
[10:43:58] <klausman>	 yeah. ENOSPC doesn't trigger on RAM issues (doub tit would for VRAM, but not sure)
[10:44:10] <klausman>	 if it was host RAM, it'd be ENOMEM
[10:44:38] <elukey>	 yes that one I know
[10:44:58] <elukey>	 but the current status of pods is weird, a lot of different errors, this one is in one of them
[10:45:15] <elukey>	 falcon-7b-instruct-gpu-predictor-default-00001-deployment-hhbx5   0/3     ContainerStatusUnknown        3             9m10s
[10:45:18] <elukey>	 falcon-7b-instruct-gpu-predictor-default-00001-deployment-hng4g   0/3     Init:ContainerStatusUnknown   4 (44m ago)   50m
[10:45:21] <elukey>	 falcon-7b-instruct-gpu-predictor-default-00001-deployment-zmqkn   0/3     ContainerStatusUnknown        8 (12m ago)   29m
[10:45:25] <elukey>	 and it varies over time :D
[10:45:39] <elukey>	 but it may all be due to the kubelet being under disk pressure
[10:46:10] <elukey>	 we have around 90G free to expand /var/lib/kubelet in theory
[10:46:11] <klausman>	 yeah, I was about to say, with parallel disk accesses and no space, all manners of weird state can happen
[10:46:26] <elukey>	 no idea if SRE wanted to keep it to 40G for some reason
[10:46:33] <elukey>	 probably not, we have new use cases
[10:46:49] <elukey>	 klausman: ok if I try to expand ml-serve1001's kubelet partition?
[10:47:01] <klausman>	 No objections
[10:47:20] <klausman>	 I wonder if we should unify the backing store for /var/lib/docker and /var/lib/kubelet
[10:49:01] <aiko_>	 elukey: o/ https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930000 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930002 are patches to add outlink stream to the changeprop prod
[10:49:05] <klausman>	 We also have the 2x1.8T spinning rust in those machines that we're currently not using
[10:49:28] <aiko_>	 I'm not sure if we can only autoscale the outlink transformer, or we need to config both the transformer and predictor. outlink prediction only takes a little time and most of the time is spent on preprocessing
[10:50:19] <elukey>	 klausman: do you have time to check -^
[10:50:23] <klausman>	 on it
[10:51:49] <klausman>	 both LGTM
[10:52:24] <klausman>	 aiko_: with the recent change, you should be able to +2 them yourself, and Jerkins will merge it
[10:52:27] <elukey>	 klausman: Aiko asked another question about scaling the transformer vs predictor, can you follow up to see if we can scale them up separately?
[10:52:33] <elukey>	 before merging etc..
[10:52:38] <klausman>	 oh, sorry, missed that
[10:52:47] <klausman>	 Looking into it
[10:56:23] <elukey>	 super thanks
[10:56:57] <klausman>	 At first glance, I don't think there should be a problem there (running fewer predictors than transformers. But I'll keep digging a bit
[11:01:44] <klausman>	 aiko_: I think we can just try it on staging, see what happens. I see no obvious reason why it shouldn't work.
[11:02:24] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "For more information on Models check the docs https://fastapi.tiangolo.com/tutorial/response-model/" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[11:08:19] <aiko_>	 klausman: we can't test it on staging as we don't have continuous stream of events to test in changeprop staging, so it's likely not to use autoscaling
[11:08:49] <klausman>	 Hm, good point
[11:10:01] <klausman>	 One thing we could do is to keep around a copy of the old deployment chart files, deploy the new state, see if there are errors and quickly roll back using the old files (copying them back into the /srv tree) if there are problems (and then doa a proper Gerrit rollack PR).
[11:19:44] <aiko_>	 klausman: ok, I'll go ahead to give +2 then. when it's merged, can you help deploy to changeprop production? thanks!
[11:20:08] <klausman>	 sure, can do
[11:27:33] <klausman>	 <- lunch (we'll deploy afterwards)
[12:08:11] <wikibugs>	 10Machine-Learning-Team: Containerize Content Translation Recommendation API - https://phabricator.wikimedia.org/T338805 (10kevinbazira) The documentation ([[ https://meta.wikimedia.org/wiki/Recommendation_API | 1 ]], [[ https://github.com/wikimedia/research-recommendation-api/blob/master/README.md | 2 ]]) doesn...
[12:16:27] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) We'll need to check which of the following errors we need to support (if not all of them) https://github.com/wikimedia/revscoring/blob/master/re...
[12:16:39] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414)
[12:17:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[12:23:02] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414)
[12:25:20] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414)
[12:25:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[12:26:54] * isaranto lunchtime
[12:55:32] <wikibugs>	 (03CR) 10Ottomata: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[12:56:08] <wikibugs>	 (03CR) 10Ottomata: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[13:01:11] <elukey>	 expanded the kubelet partition with 80g more
[13:14:06] <elukey>	 ok way better now, falcon is still failing but not like the last time
[13:14:50] <elukey>	 it seems that it fails to bootstrap in time, the health checks are possibly a little aggressive
[13:15:17] <elukey>	 Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[13:15:20] <elukey>	 Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
[13:15:21] <elukey>	 and then it fails
[13:19:48] <klausman>	 elukey: unrelatedly, the first bit of the articletopic update (replicacount and versions) is done
[13:20:26] <elukey>	 klausman: nice, so the transformer and predictor can scale up differently?
[13:20:46] <klausman>	 Yes
[13:20:51] <elukey>	 very nice
[13:21:10] <klausman>	 Whether that works well unde rheavily load, we'll have to see. Startup time of new replicas etc
[13:32:02] <wikibugs>	 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) The other thing that I just noticed is that this service [consumes 0.4% of the resources it is allocated](https://grafana.wikimedia.org/d/Y5wk80oG...
[13:40:57] <klausman>	 elukey: ware you still working on ml-s1001?
[13:41:20] <klausman>	 elukey: AM shows a high latency alert for exec_sync: https://alerts.wikimedia.org/?q=alertname%3DKubeletOperationalLatency&q=team%3Dsre&q=%40receiver%3Ddefault
[13:48:49] <klausman>	 and it stopped firing just now
[13:49:04] <klausman>	 er, no, it didn't, I am just fatfingering Alertmanager :)
[15:00:48] <klausman>	 elukey: its' weird, the alert is firing (mentioning ~1s latency, but the Grafana dashboards don't support that at all
[15:30:29] <isaranto>	 elukey: sry just saw your messages above regarding falcon
[15:30:54] <isaranto>	 I also checked logs...the loading thing that gets stuck is the time that the model is being loaded on GPU
[15:31:01] <isaranto>	 hmm
[15:31:18] <isaranto>	 actually no..
[15:31:47] <elukey>	 isaranto: the main issue IIUC is that the readiness probe is too aggressive
[15:31:50] <isaranto>	 we do it in 2 steps, so we first load the model on cpu memory and then load it to the GPU via the todevice call
[15:31:57] <elukey>	 I am trying to add more time but it seems not working
[15:32:09] <isaranto>	 ok, if it is that then fine
[15:32:55] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: feat: add Response Models in ores-legacy API [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929743 (https://phabricator.wikimedia.org/T330414)
[15:33:41] <isaranto>	 btw folks I added some more things in the above patch for ores-legacy. Specifically some example responses defined in a json file that are used in Swagger
[15:37:45] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Create ORES migration endpoint (ORES/Liftwing translation) - https://phabricator.wikimedia.org/T330414 (10isarantopoulos) I have also added some example responses in a json file. These are defined as examples in FastAPI endpoints and are used to show sample response...
[16:09:58] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/930209 that should allow us to set the readiness probe
[16:16:57] <isaranto>	 nice!
[16:26:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "If we change anything on the Lift Wing side and forget to fix this it will not work, so maybe as follow up let's add a comment in the Lift" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[16:28:36] <elukey>	 going afk for the evening folks!
[16:28:37] <elukey>	 o/
[16:53:11] <wikibugs>	 10Machine-Learning-Team, 10Research-Backlog, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog: [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10MarkTraceur)
[16:55:43] <isaranto>	 o/
[16:58:35] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414)
[16:59:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ores-legacy: Change message in RevisionNotFound error (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/930166 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[20:02:20] <wikibugs>	 (03PS3) 10DCausse: Unify the meta subfield in events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690
[20:02:22] <wikibugs>	 (03PS3) 10DCausse: events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648)
[20:03:00] <wikibugs>	 (03CR) 10DCausse: Unify the meta subfield in events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929690 (owner: 10DCausse)
[20:09:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] events: propagate the event time with the dt field [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/929735 (https://phabricator.wikimedia.org/T267648) (owner: 10DCausse)