[07:08:30] hello folks! [07:08:51] Serobot seems to be in the process of migrating to Lift Wing and Revert Risk! https://github.com/dennistobar/serobot/issues/5 [07:08:57] first bot on Lift Wing :) [07:09:07] seems a low traffic one, but really great milestone [07:27:17] \o [07:27:40] Yes, nice milestone. WOnder if we should give Dennis a symbolic reward of some kind :) [07:56:43] o/ [08:09:11] (03CR) 10Ilias Sarantopoulos: [C: 03+2] fix: rename bloom to llm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928591 (owner: 10Ilias Sarantopoulos) [08:15:52] (03CR) 10CI reject: [V: 04-1] fix: rename bloom to llm [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928591 (owner: 10Ilias Sarantopoulos) [08:26:34] ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device: [08:26:48] this is what hashar tried to fix last time :( [08:28:05] ouch [08:28:27] klausman: very good point, I think that we can ask Chris to send a wiki gadget [08:34:26] I issued a rebuild but doubt it would work. how can we address this? (we are referring to failing CI above) [08:35:10] I think that hashar needs to clean up the node, let's see how this run works [08:35:22] the long term plan IIUC is to upgrade the instances with more disk [08:35:25] but it will take some time [08:49:50] (03CR) 10Elukey: [C: 03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/928591 (owner: 10Ilias Sarantopoulos) [09:34:56] hmm, it failed to publish the image [09:40:37] weird, the image was just published though [09:41:57] however I dont see it in the registry https://docker-registry.wikimedia.org/ [09:42:28] isaranto: so https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-llm-publish/1/console failed due to a timeout, but if you check the logs it completed [09:42:56] I think there was a race condition.. our docker registry updates the images periodically in the UI, so it may not appear straight away [09:43:10] ok then [09:43:37] 'publishedImage':'docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-llm:2023-06-09-090444-publish' [09:45:43] isaranto: wow falcon?? [09:45:50] ack! I just update the patch - > https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/927611 [09:46:12] no? worth to try [09:46:17] yes yes :) [09:47:00] the issue with these. "small" tests is that everything ends up breaking so they interrupt our work :) [09:47:25] isaranto: directly with the gpu, without the "-gpu" suffix etc..? [09:47:31] not against it, asking if it was the intent [09:48:01] u mean the -gpu in the suffix name? [09:48:06] yep [09:48:32] I'll add it for consistency [09:48:36] klausman: o/ after Ilias deploys the above we won't need the old bloom images anymore in the registry, do you have time to clean them up? [09:48:43] can't recall if we did it together in the past or not [09:49:22] I can do that. I'll have to chase some docs, but that shouldn't be too hard. What is the specific criterion for removing them? [09:49:53] regarding the specific ones: they are not used and they are too big [09:50:52] klausman: https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images - in this case I think we could just drop bloom* [09:51:12] alrighty, will do [09:51:19] IIRC we also changed the name to ores-legacy images, the old ones were called ores-migration or similar? [09:51:22] we should drop those too [09:53:39] isaranto: merged [09:54:31] yes the old ones were ores-migration so all these can be removed as well -> https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-ores-migration/tags/ [09:54:41] So basically these: https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-bloom/tags/ and https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-ores-migration/tags/ [09:54:51] exactly! [09:56:40] Alright, let me know when I can proceed [10:18:49] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) [10:24:30] the falcon one managed to get scheduled but there are issues with the bloom ones [10:25:08] actually only with the bloom-3b-gpu for now [10:25:51] checking [10:29:52] now falcon is failing as well in the kserve-container while loading the model (perhaps it doesnt fit in GPU) but cant see any relevant error [10:31:03] weird [10:34:25] isaranto: cleaned up a little, it seems that bloom-3b-gpu still holds the gpu [10:34:28] did we remove it? [10:34:46] ah no snap [10:34:57] ok so now falcon is waiting for the gpu [10:34:58] no we didn't [10:35:14] but falcon was sheduled and started so it got the gpu right? [10:35:59] it is currently downloading the model in the storage-initializer container (again) [10:36:12] hopefull yes now, maybe it was blocked by a rougue 3b version [10:36:35] but I will remove it since it is failing again and again [10:37:00] it wasnt blocked it has already done this 2-3 times [10:39:34] 0/10 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 7 Insufficient amd.com/gpu. [10:40:10] is disk pressure I/O or disk space? [10:40:58] I know, but previously it started and failed while loading the model to gpu [10:41:24] yes yes but then it goes into disk-pressure when it fails, not sure why [10:41:30] ack [10:41:44] is there a grafana dashboard that we can check pod resources? I dont have access to kubectl top.. [10:42:25] not sure if we have anything for disk pressure [10:43:27] so far we saw this behavior when the pod had troubles with the GPU [10:43:42] how big is the model? Does it fit in the GPU's ram? [10:46:08] it is 14.4 GB so it may not fit [10:46:21] ahh okok, then it may make sense [10:46:30] I suggest to just remove the deployment for now [10:46:45] at this point yes, sadly we have probably reached a limit [10:46:57] good infos for when we'll buy gpus [10:47:14] going out for lunch! [10:50:11] klausman: all the aforementioned docker images can be deleted now [10:53:10] alright, will do [10:53:44] Well, that didn't work: requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://docker-registry.wikimedia.org/v2/wikimedia/machinelearning-liftwing-inference-services-ores-migration/manifests/sha256:2761776161eed83147e2c162db0153037c28176abe52f831697b9d7dcd395474 [10:55:00] Same for the bloom ones [10:55:59] Gotta head to an appointment now, ttyl [10:59:11] o/ [10:59:36] I have one final idea before we kill the falcon deployment. [11:04:45] try to load a distilled version which will use 8bit integers rather than 16bit floats. will hurt the performance but if this transformation is done before it loads the whole model it would work [11:17:22] klausman: did you sudo -i? I think that the client needs to auth creds to work [11:50:06] unfortunately it seems that my idea won't work cause....cuda..https://pypi.org/project/bitsandbytes/#description [12:09:04] * isaranto lunch [12:33:13] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9) [12:34:21] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9) [12:35:08] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10Samwalton9) Tagging #machine-learning-team so that we can explore what it would take to set this up. [13:20:55] tried to drop the images, errors as well.. [13:20:58] asked to sre [13:20:59] weird [14:04:55] elukey: yeah, used sudo [14:07:23] serviceops is seeing the same issue, weird [14:48:48] Who is the SRE team running the docker infra? [14:53:51] service ops [14:54:01] docker registry infra I mean [14:54:59] ah, ack. Are they looking into it? [14:56:57] I don't think so, maybe we need a task [14:57:07] and we could investigate ourselves to help them [14:58:04] I'll investigate some [15:00:06] Hm, the FAQ on the runbook for d-r reads: [15:00:14] I need to delete an image from the registry [15:00:16] You need to ssh in any registry instance and delete the objects that belongs to the image from the swift container, this should not be done unless there are good reasons to do it (security incident for instance). [15:00:32] Was the script I tried added after this? I.e. is the FAQ out of date? [15:01:50] what FAQ are you referring to? [15:04:31] https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook [15:04:58] I looked it in the hopes of find a log location to look at, but found nothing [15:06:00] well there are two different things [15:06:14] The wild thing is that the URL mentioned in the error 401ing works just fine in the browser on my workstation [15:06:15] 1) docker registry ctl [15:06:19] 2) the docker registry itself [15:06:58] IIUC with your runbook there is an indication about what to do when dealing with the docker registyr, deleting things from the swift container seems something to be done only for maintenance purposes [15:07:04] I also wondered if I was running on the wrong machine, but AFAICS, build2001 is thonly build hist [15:07:15] yes but it runs the ctl tool [15:07:16] host* [15:07:20] not the docker registry [15:07:29] yeah, ack [15:09:10] I'll open a task [15:09:27] the registry* nodes run the docker registry IIRC [15:09:31] you should find logs on them [15:10:00] ar the regsitry* hosts the right ones? [15:10:16] yes I think so, you can check in puppet's site.pp for confirmation [15:12:20] 10Machine-Learning-Team, 10serviceops: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman) [15:15:53] So nginx logs the 401 in the access, log, but no errors in error.log. So it's not that nginx or its config are broken [15:19:13] Ah, found it. [15:19:25] elukey: another case of me not reading the docs properly [15:21:21] 10Machine-Learning-Team, 10serviceops: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman) 05Open→03Invalid This was caused by me using the wrong host. What I _should_ have used: `docker-registryctl delete-tags docke... [15:23:13] elukey: used wikimedia.org domain on the registry name, instead ov discovery.wmnet [15:23:22] with the right domain it works fine [15:23:42] ah snap wow [15:23:49] can you update the docs with a big warning?? [15:23:53] good finding! [15:24:38] on it already [15:24:46] <3 [15:26:06] and done. [15:26:45] elukey: I already closed the task, who did you talk to in serviceops? (or I can tell them about what was going on) [15:27:05] already pinged cgoubert on the k8s chan, mentioning that you fixed it [15:27:14] roger! [15:39:16] going afk for the weekend folks! o/ [15:39:50] \o [16:01:29] o/ [16:01:50] Hopefully only some config changes are left for the ores extension wo we can deploy it per wiki [16:02:02] Going afk as well o/ [16:33:25] 10Machine-Learning-Team, 10artificial-intelligence, 10Research: [Epic] Article importance prediction model - https://phabricator.wikimedia.org/T155541 (10Isaac) 05Open→03Declined I'm going to set the status of this to Declined but other folks should feel free to take it on if desired. After working in th... [17:40:32] 10Machine-Learning-Team, 10Moderator-Tools-Team: Retrain revert risk models on a regular basis via moderator false positive reports - https://phabricator.wikimedia.org/T337501 (10diego) The easiet (probably not the best) solution I can image is to have an app to report errors in an strucuted format (eg. Revisi... [18:31:42] (03CR) 10Ladsgroup: "This is almost ready. nice!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/915541 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [20:40:27] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review, 10Platform Team Initiatives (New Hook System): Update ORES to use the new HookContainer/HookRunner system - https://phabricator.wikimedia.org/T338444 (10Umherirrender)