[06:47:59] Good morning! 🌞 [08:06:39] (03CR) 10Kevin Bazira: [C: 03+1] llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [08:11:55] (03CR) 10Ilias Sarantopoulos: [C: 03+2] llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [08:20:47] (03Merged) 10jenkins-bot: llm: update transformers module [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989831 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [09:49:58] Morning :) [09:51:02] o/ Tobias [10:29:21] browser crashed. lost all my tabs. disturbing and liberating at the same time :) [10:32:46] :D [10:46:22] klausman: are you ok with this? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/989913 [10:47:00] vaguely. let me check something re: GPU reservations. I'll add a +1 or more info in a few mins [10:47:31] we can discuss here if u want [10:48:12] I am unsure what happens if NLLB "Accidentally" is scheduled on a machine with an idle GPU [10:48:37] Will it try and succeed to use it? Or will it fail? Or will it just not see it? [10:50:14] it will just use the cpu. Were we may have an issue is when a pod requests a GPU and there is no cpu/memory capacity on the gpu node [10:50:42] Ack. [10:52:39] in the future we could do the following: label nodes with a GPU. e.g. "gpu-node" or whatever and then add a nodeSelector to our deployments [10:52:55] in the same context we can have label "cpu-node" [10:52:56] https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector [10:53:16] ty! [10:56:54] there could be many ways to do it. I would advocate for the simpler one that works [11:04:08] yeah, agreed [11:37:26] (03PS1) 10Kevin Bazira: test: refactor langid and ores-legacy load tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/989863 (https://phabricator.wikimedia.org/T354722) [11:39:45] I was getting OOM errors and figured out we had limitranges were too low https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990044 [11:45:29] * klausman lunch [12:41:19] * isaranto lunch! [13:51:20] (03PS1) 10Jforrester: extension.json: Drop RL targets definitions, no longer honoured [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) [14:36:45] isaranto: I can plus-2 the limitrange change and push it if you want. Or should I wai? [14:49:01] Good morning [14:54:53] Morning Chris, I see you're slowly returning to more normal waking times :) [14:57:30] Hey Chris! [15:06:39] klausman: if it is ok please deploy it :) [15:06:45] on it [15:06:49] ty! [15:08:07] hmmm. the diff gives me pause [15:10:05] https://phabricator.wikimedia.org/P54717 [15:11:27] I am not sure I understand limitRanges correctly [15:11:30] what do u mean by. "pause"? [15:12:17] Are we sure we're not now requesting 20/22Gi for all those services? [15:13:19] I think it's ok, but I want to be sure. [15:14:10] Mh. nevermind, I think I grok'd it now [15:15:57] it is ok. it sets the upper bound for resources [15:16:07] and pushed. [15:16:18] it actually means that even if you request 40Gi and they are available you won't be able to schedule your pod [15:16:25] I also wonder why the LLM NS is not mentioned [15:16:42] oh it is, I fatfingered the search %-) [15:17:02] I think I may have to delete the pod, since it's still crashlooping [15:18:43] I think it is ok [15:19:45] it is starting. downloading the model (which takes some time). We'll probably have failures due to the low readiness probe but it is ok. we'll figure it out [15:20:08] 2/3 Running [15:20:13] aaaand OOmkilled [15:22:07] On Grafana, the limit is still listed as 2Gi for the kserve container [15:23:04] It is indeed 2Gi I checked through describe. can you delete it please manually? [15:23:15] will do [15:23:33] and done [15:23:37] I'm wondering if we need to do sth else to allow it [15:24:57] it's still req'ing 2Gi [15:26:47] aaa let me bring some tea first. I'm sure it will help [15:29:51] Ok, si I bumped the mem to 21Gi in the file as checked out on the deployment server, but the diff was still empty. So the YAML as-is does not apply [15:31:13] I think I found it [15:31:41] yep, found it, sending a patch [15:32:31] nice! [15:33:38] So the Container: line was missing, and I also normalzed the indentation with the rest of the services [15:34:06] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/990139 [15:34:41] That version shows a diff when using kubectl -n .... diff [15:35:41] ouch! my apologies, such a bad copy paste from my side.. [15:35:56] Nah, it's fine. YAML is a harsh beast in these matters [15:36:10] Would I prefer JSON? Also no :) [15:37:39] applying [15:39:50] oh...kay? The pod is gone entirely? [15:41:41] yes... 🤷 I'm looking.. [15:42:18] `fails to reconcile predictor: fails to update knative service: Operation cannot be fulfilled on services.serving.knative.dev "falcon-7b-instruct-gpu-predictor-default": the object has been modified; please apply your changes to the latest version and try again` [15:43:25] yes, found it [15:43:42] if you check the events the reason is a bit higher up [15:43:56] ```Error creating: pods "falcon-7b-instruct-gpu-predictor-default-00002-deployment-jh28g" is forbidden: maximum cpu usage per Pod is 10, but limit is 11``` [15:44:30] But we're only requesting 8? [15:45:10] Ah, maybe the other containersin the pod request enough CP to add to those 8 [15:45:20] yes. there are 3 containers [15:45:28] Yeah, 8+3=11 [15:46:04] Do you want to try with 7 for the kserve container or bump the global limit? [15:46:10] can you manually change cpu to 6? and I'll send a patch for that. I think we can increase the limitranges for this [15:46:34] ack, will check locally if 6 makes it work [15:47:03] we wrote at the same time. i don't mind with either. I'd bump the global limit since we're going to be experimenting [15:47:28] Yeah, sgtm [15:51:01] Another oomkill [15:52:04] But looking at Grafana, I think it may be the storage-initializer that is running out of memory [15:53:15] storage-initializer seems to succeeed. it just downloads the model [15:53:35] The other containers don't go anywhere near their memlimits, though [15:53:45] we get an oom on model loading which is weird given that we just got the 20Gi of memory on the pod [15:54:22] So it's definitely the ks container that ooms, not the s-i one? [15:56:35] Ok, now I see the ks container hitting the limit [15:57:12] I vaguely remember we used to have the problem of using twice the memory than actually necessary in the past, could this be that? [15:59:02] that's true but that is the argument `low_cpu_mem_usage` which we have set to True [16:01:41] Ok, then I'm unsure what's going on there [16:06:49] me2. I'll dig a bit more. thanks for all the help Tobias! [16:07:20] np, if anything is needed with admin-ng (or otherwise), just ping [16:38:06] as time passes by it starts to sound like a monday morning Ilias' problem :) [16:57:19] Sometimes a call like that must be made [17:00:58] I'm making that call and heading out [17:01:10] Alright, enjoy the weekend! tlak to you on Monday [17:02:30] Cu! Have a nice weekend folks! [17:13:26] (03CR) 10DannyS712: extension.json: Drop RL targets definitions, no longer honoured (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) (owner: 10Jforrester) [17:24:48] Heading out now as well [17:59:04] 10Machine-Learning-Team, 10ORES, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 54 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Iniquity) [19:12:43] (03PS1) 10Jforrester: build: Update MediaWiki requirement to 1.42 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990237 [19:20:04] (03CR) 10DannyS712: [C: 03+2] build: Update MediaWiki requirement to 1.42 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990237 (owner: 10Jforrester) [19:31:15] (03CR) 10CI reject: [V: 04-1] build: Update MediaWiki requirement to 1.42 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990237 (owner: 10Jforrester) [20:36:48] (03Merged) 10jenkins-bot: build: Update MediaWiki requirement to 1.42 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990237 (owner: 10Jforrester) [20:37:15] 10artificial-intelligence, 10Research-Freezer, 10Research-management: Review recommendations in the Toronto Declaration on human rights and artificial intelligence - https://phabricator.wikimedia.org/T197683 (10leila) The original scope of this task was beyond the work of the Research team. I'm going to decl... [20:37:23] 10artificial-intelligence, 10Research-Freezer, 10Research-management: Review recommendations in the Toronto Declaration on human rights and artificial intelligence - https://phabricator.wikimedia.org/T197683 (10leila) 05Open→03Declined [22:32:26] (03CR) 10Umherirrender: extension.json: Drop RL targets definitions, no longer honoured (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) (owner: 10Jforrester) [22:32:29] (03PS2) 10Umherirrender: extension.json: Drop RL targets definitions, no longer honoured [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) (owner: 10Jforrester) [22:32:52] (03CR) 10Umherirrender: [C: 03+2] extension.json: Drop RL targets definitions, no longer honoured [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) (owner: 10Jforrester) [22:52:45] (03Merged) 10jenkins-bot: extension.json: Drop RL targets definitions, no longer honoured [extensions/ORES] - 10https://gerrit.wikimedia.org/r/990097 (https://phabricator.wikimedia.org/T328497) (owner: 10Jforrester)