[00:04:25] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Etonkovidova) 05Open→03Resolved Checked `tumwiki`, `elwiki`, and `dawiki` - "Add a link" feature seem to be... [00:56:32] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) [07:01:22] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [07:01:55] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) @jcrespo kindly check what is needed for backup involved hosts, thanks! [07:29:28] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) [07:30:35] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) >>! In T335042#8795210, @Marostegui wrote: > @jcrespo kindly check what is needed for backup involved hosts, thanks! Done. [08:20:47] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Thai-Sites, 10User-notice: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10kevinbazira) a:03kevinbazira [08:23:44] isaranto: o/ green light to test the ores-legacy deploy to staging whenever you want :) [08:36:30] \o/ thank you [09:00:15] I am getting an error when I try to diff/sync https://phabricator.wikimedia.org/P47264 [09:13:01] ah interesting! [09:13:34] this is the first service-ops-like service that we deploy, we are missing some RBAC rules probably [09:15:40] yeah I got what the issue is [09:17:57] (checking a couple of things) [09:26:15] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/910429/ should fix it [09:26:23] I missed this bit yesterday [09:26:51] so the cluster role to use should be 'deploy' not 'deploy-kserve' [09:28:57] aha thanks for the info [09:29:20] every day I understand something more [09:29:27] hopefully I'll get there :) [09:31:34] isaranto: don't worry it is the same for all of us, I clearly forgot this bit :D [09:31:56] but I learned that usually when I see any kind of permission issue it is due to a pebcak in RBAC rules [09:35:59] isaranto: green light to re-test the deployment [09:36:08] ah no wait sorry [09:36:17] Error: UPGRADE FAILED: release namespaces failed, and has been rolled back due to atomic being set: cannot patch "deploy" with kind RoleBinding: RoleBinding.rbac.authorization.k8s.io "deploy" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"deploy"}: cannot change roleRef [09:36:25] whatttt [09:43:14] ok I had to delete the rolebinding first [09:43:15] weird [09:43:19] isaranto: green light :) [09:43:39] did u change anything else? [09:43:46] now it works 🎉 [09:44:25] nono just deleted the rolebinding and re-deployed the namespace settings [09:44:46] and the pod is up! [09:45:25] yep! [09:45:56] isaranto: I think that we forgot to set the mesh values, I don't see the tls-proxy container [09:46:14] (so in theory now it cannot call lift wing) [09:46:49] do we have to setup LVS ? [09:46:56] or can we access the api? [09:47:06] nono the tls-proxy is a sidecar to call other apis [09:47:42] I didn't phrase it correctly, my question was irrelevant to the liftwing thing [09:48:41] we should be able to test it without lvs [09:49:05] (03PS1) 10Ilias Sarantopoulos: Delete some files [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910436 [09:49:14] but one thing that I don't recall is if we discussed how to call lift wing [09:49:28] the tls-proxy sidecar will expose a localhost:6031/etc.. endpoint [09:49:37] but the ores-legacy code needs to be aware of it [09:49:49] so not calling inference.discovery.wmnet directly [09:49:58] (03CR) 10CI reject: [V: 04-1] Delete some files [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910436 (owner: 10Ilias Sarantopoulos) [09:50:03] I don't recall if it is already configurable [09:51:35] I mean how can I access the ores-legacy endpoint? [09:53:28] I am not 100% sure yet, while I look for it let's fix the mesh, this is my point :) [09:55:05] ok ok [09:55:17] sry was trying to understand [09:55:21] thanks for all the help [09:55:40] nono no need sorry I wasn't clear, it is new for me too so I was trying to reason out loud :) [09:58:01] Morning! [09:58:05] (just barely) [09:58:19] hello :) [09:58:41] Protip: when your phone is a 2% battery and you hook it up for charging before going to sleep, check that it's actually charging, otherwise you may not get an alarm in the morning :D [10:01:52] (03PS1) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [10:02:25] klausman: thanks for the tip, I do this all the time! [10:02:26] haha [10:02:28] I envy a lot the super power of sleeping after the alarm clock, cannot do it :( [10:02:40] (I mean the time of the alarm clock) [10:02:47] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [10:02:52] I wake up before the alarm clock [10:02:59] so I can verify that it works :P [10:03:22] plz nevermind the multiple patches above --^ [10:03:27] everything is WIP [10:03:46] elukey: for me it depends, if I'm in a good workday rhythm, I would wake 30m later at the latest. But my body is still in "wild mode" due to the two weeks of PTO :) [10:41:28] isaranto: sorry took me a bit but https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/910442 [10:41:47] in theory we should be able to use the inference-staging lvs endpoint, since it points to the ingress pods [10:41:52] but using a different Host header [10:42:01] not 100% great but we may have to do it [10:42:16] I didn't think about it before [10:43:56] Ack [10:45:04] I am testing the minimum set of rocm packages to make tf working, the docker image size is 14G [10:45:07] sigh [10:45:12] christ. [10:45:31] Can you share it somehow? I'd like to take a look (and try an insane experiment :)) [10:45:58] I am building it now, it will be published to the docker registry [10:46:05] but you can build it locally [10:46:11] with docker-pkg [10:50:15] Is the dockerfile already in the integration/config repo? [10:52:00] I added it to production-images, so it doesn't go through blubber etc.. [10:52:10] ah, right [11:02:09] ok cert for staging created, but istio still doesn't show me the new route for ores-legacy [11:02:12] mmmm [11:04:53] ok so it seems there, but I don't know how to use it (yet) [11:05:00] going out for lunch, will check later :) [12:41:27] Going afk for 1-2 hrs. Lunch and return home since there is a power outage in the co-working. lol. Will work in the afternoon [12:58:57] elukey: is it normal that the d/l from the wmf mirrors is super slow? I get like 5mbps tops [12:59:36] specifically on rocm-llvm [12:59:59] klausman: there are surely some rate limits in place to limit the bew [13:00:00] *bw [13:01:53] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Thai-Sites, 10User-notice: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10kevinbazira) [13:02:01] for the internal nodes it should be way less restrictive [13:02:11] (and it kinda makes sense to limit external bw) [13:02:47] yeah, especially for apt [13:05:11] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10Thai-Sites, 10User-notice: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 (10kevinbazira) 19/19 models were trained successfully in the 16th round of wikis. [13:22:38] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) [14:03:09] elukey: one thing re: rocm et al I wonder about is: there is a whole llvm toolchain in there, and I don't think all of it is needed (or if it is at startup, the artifacts should be cacheable, no?_ [14:03:55] I'm also currently upx'ing every elf binary I can find, to see how much difference in image size that'll make. It's a bit crazy, but if the gain is large, might be worth it. [14:12:23] Hm. 14G -> 12G. Maybe not worth it [14:43:01] Back! [14:46:10] isaranto: you have more experience with GPU-accelerated models than I do. Would serving a model using a GPU really need a full llvm/clang pipeline? [15:01:55] klausman: I have mostly worked with GPU in training where this type of optimization is not needed [15:02:45] it would be faster (significantly ?) but I doubt if it is needed to start with. [15:03:00] it could be the next step though [15:03:04] ack. [15:05:41] klausman: sorry I was in a meeting - the toolchain is brough in by deb dependencies basically, I tried to trim down what's needed for tensorflow [15:05:59] not sure what is the rev dep of llvm, but probably something brings it in [15:06:08] 14G->12G is great, surely worth it [15:06:43] it takes a while to process, and most of the gain is in the llvm binaries. So if we don't need all of them, that would be a better approach, I think [15:06:44] never tried to trim elf binaries before, so not sure how we should translate that into a Docker file or similar [15:07:22] basically you run upx against the binaries and it makes them smaller in-place. It would be a (close to) last step in the build process of the image [15:07:43] for the llvm binaries, typical reduction was to ~1/3 of original size. [15:08:03] it strips e.g. debug info and so on. Somnething we don't need in prod [15:10:19] yeah but this should be something that the AMD folks should do when releasing packages, probably [15:10:27] we could open a github issue [15:11:13] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Some info gathered while reading docs :) We have essentially two options: * AMD (GPUs compatible with ROCm) * Nvidia As it came up from T333009, the Nvidia plugin seems superio... [15:11:39] I wonder if a pure "runtime" package set could be built, only containing what a typical model needs to run (i.e. not training) [15:13:54] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) I found some cards to evaluate, a good compromise (in my opinion, please chime in if you have more options!) between price and performances: * https://www.dell.com/en-us/shop/am... [15:20:51] klausman: I added a proposal for CapEx on slack, lemme know what you think [15:20:54] also others :) [15:21:44] I also made a summary of gpu sharing options in https://phabricator.wikimedia.org/T327923#8796598 [15:41:58] elukey: replied [16:01:59] thanks! [16:02:14] going afk, have a nice long weekend folks! [16:02:31] ciao Luca! [16:07:03] isaranto: I didn't forget about ores legacy, there is some mess that I don't understand yet, will restart working on it next week :) [16:10:43] thanks , as I said it is not urgent [16:11:04] i am stuck in some weird out of memory errors with mediawiki extension [16:11:06] sigh [17:18:30] (03PS2) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [17:20:47] (03PS3) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [17:21:35] (03PS4) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [17:22:10] (03PS5) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [17:37:58] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [18:10:38] Mediawiki extension update: managed to reduce the changes from the duplication hell of 4k lines to 300 buut it now doesnt work :) [18:10:40] to be continued [18:10:51] going afk, cu all <3