[09:39:00] https://www.kubeflow.org/docs/components/pipelines/v1/sdk/manipulate-resources/#persistent-volume-claims-pvcs [09:39:23] I didn't dig deep but it seems that kubeflow offers a way to integrate with PV [09:39:35] not sure if optional or not, we'll see [09:39:44] (to follow up from yesterday's discussion) [09:41:13] "Request the creation of PVC instances simple and fast" sounds so easy :) [09:41:47] exactly :D [09:42:01] I am really scared about Kubeflow to be honest [09:42:21] I haven't yet checked how many kms of yaml and docker images we'll need to import [10:01:02] XioNoX: o/ [10:01:10] I have 4 nodes in row E/F to reimage, https://netbox.wikimedia.org/search/?q=ml-serve1&obj_type= [10:01:15] hello, about to join a meeting [10:01:19] please leave a message [10:01:26] shall I proceed or do I need to clear etc.. on the switches? [10:04:54] I started the reimage of ml-serve1005, let's see how it goes [10:27:21] I'm sorry to have missed the discussion yesterday about Kubeflow and PVCs. We're really not far from getting this new Ceph cluster ready for testing PVC support. [10:29:11] elukey: Any chance we could schedule a catch-up to discuss kubeflow, with whoever is interested? [10:29:54] btullis: o/ so yesterday we were just wondering what requirements/use-cases we have for PV, and I mentioned that Kubeflow *may* have needed them, but we are still very far from having a complete picture [10:30:02] we were more interested in your team's use cases :) [10:36:46] btullis: you were indeed missed yesterday. It turns out that the other teams currently have either a) 0 use cases, b) have use cases that are kinda wishlist type c) possible future use cases [10:36:55] and we were missing DE's input [10:38:34] OK, cool. Ours are pretty much b's and c's too :-) [10:42:20] If we take Datahub as an example, it's working today on minikube so you could say that it's not a current requirement. But in order to deploy it we needed: a MariaDB database on a shared DB cluster, three Ganeti VMs for an OpenSearch cluster, and a standalone VM for running Karapace. None of these were huge and the upstream chart is/was already optimized for using PVCs for all of these elements. [11:05:23] elukey: how is it going? [11:06:38] XioNoX: the first host worked nicely, kicked off the other 3 reimages [11:24:53] btullis: so, we briefly touched on that subject (Datastores vs k8s Persistent Storage, aka PVCs) too. In broad strokes, datastores, whether one means MariaDB/Cassandra/Elasticsearch or whatever else that provides some way of running distributed (e.g. replication), provided some k8s operator, don't necessarily need Persistent Storage. A PVC would [11:24:53] make the operations of managing them faster and more efficient ofc but it's not a requirement. [11:25:29] What's more interesting in those cases is who and how they manage those datastores and whether the level of service provided is adequate for the use cases [11:27:10] Other use cases we have are downloading huge blobs on init and writing them to a local ephemeral disk. The typical workflow would be some ML model, stored in swift/s3, downloaded by an initContainer and unzipped in an emptyDir. [11:27:38] That's suboptimal of course and a PVC would make it easier and faster to start up those pods [11:27:52] and also decouples the initialization phase of the pod from the deployment of the model [11:29:35] btw, what's the ownership model for the dependencies of Datahub? I guess if MariaDB misbehaves DE goes to DBAs for help, but OpenSearch/Karapace ? Entirely on DE's shoulders? [11:29:50] akosiaris: Yes, agreed. Nothing we have in the DE team, even on the horizon, has any hard requirement for a PVC. Also yes, running a reliable datastore in pods is not without its challenges either. Fully agree. [11:32:51] akosiaris: Ownership model, yes similarly you've hit the nail on the head. The ownership is pretty much down to DE only. We've had a similar issue with Airflow recently. [11:33:54] And that's added to all the things DE already owns. Ouch. [11:34:03] Current version of Airflow (2.1) is compatible with MariaDB. Next version of Airflow (2.5) is compatible with only MySQL 8+ or PostgreSQL. [11:34:31] wow [11:34:42] 😮 [11:34:49] I did not see that coming [11:34:49] Specifically incompatible with MariaDB, so our only option is to expand our estate with some PostgreSQL servers, which again fall to us to own and manage. [11:35:10] ah, that explains what I 've seen recently. I 've been meaning to ask about that [11:35:51] thanks for sharing that context, I wasn't aware [11:36:57] We currently manage 5 single-tenant Airflow instances for different teams, so they are currently sharing a physical PostgreSQL server + 1 replica. We have backups, but the Data Persistence team (quite reasonably) doesn't want to provide any guarantees on backup recovery objectives, so we look after that. [11:38:26] Admittedly, running distributed PostgreSQL clusters under Kubernetes with PVCs isn't going to be 'a doddle' but using Ceph for this kind of requirement/wishlist item is part of a plan to try to reduce the count of physical servera and VMs that we have to manage as a team. [11:45:40] Correction: SQLite is also another option for Airflow 2.5 but you get the idea. That wasn't going to work for us right now. [11:50:43] I think I get the idea. I am not sure managing an entire Ceph cluster vs managing VMs will end up being less work intensive for DE, but that remains to be seen. I am definitely interested in the effort and wanna try out failure scenarios for some workloads of our own. [12:07:06] We're also looking at making use of the S3 gateway feature, if anything, more than the PVC support. The hope is that over time dse-k8s+ceph this might completely replace Hadoop, which would bring our workload down. Here are some slides that I blithered over yesterday on the Ceph progress: https://docs.google.com/presentation/d/1Lbsn2fEDH0eJqlgWAM_gyOkKXsaE84789DsT_MOJwQo/edit [15:14:02] XioNoX: the funny thing with ml-serve100[5-8] is that ml-serve1005 reimaged fine, 1006/7/8 show a weird behavior - the hosts seem to be able to PXE (so they don't ever get in d-i) [15:14:06] all of them in E/F [15:14:26] elukey: could it be a NIC upgrade issue? [15:14:36] as in do they need to have their NIC upgraded? [15:14:54] ah no idea, when we reimaged them months ago nothing weird happened [15:15:04] do you mean running the cookbook to upgrade the NIC's firmware? [15:18:01] sometimes that's the issue [15:18:14] but can't say for sure for now [15:18:40] and you mean DON"T seem to be able to PXE? [15:22:28] switch dhcp-relay config seems, fine so next step is to tcpdump around to see if the install server gets the queries and replies [15:22:43] do you get anything on the console? [15:23:16] I have to step away for a bit [15:27:22] I am in a meeting too, we can pair later if you have a min [15:56:15] ok I tried to manually force PXE while being in `console com2` and I saw a Media failure - check cable error [16:07:22] XioNoX: lemme know if you have a min to check with me this issue [16:07:56] I don't see DHCP logs on install1004 on syslog related to the IP/mac of ml-serve1006 [16:08:22] could it be another example of the clear_dhcp() kind of issue? [16:10:54] elukey: alright, let's try it [16:13:07] elukey: I see that it have multiple NICs, is is trying to boot on the correct one? [16:19:07] XioNoX: good point, I can check in the BIOS config, but it worked before so I'd be puzzled if something changed in the past months [16:19:21] what do you prefer? Shall I retry to force PXE? [16:20:01] (checking the BIOS settings in the meantime) [16:23:01] yeah check BIOS, the check cable stuff seems like it's not even trying to pxeboot [16:44:57] XioNoX: I changed some settings, and now I can see the mac address of the NIC with physical link speed that tries to contact DHCP while doing PXE (no more Media error etc..) [16:45:04] but it doesn't get any IP [16:47:40] what's the mac? [16:48:38] e4:3d:1a:ad:d7:a0 [16:50:47] I am tcpdumping on install1004 [16:56:00] thx [16:56:12] got sidetrack with the esams DDos stuff [16:57:53] ah sorry we can do it another time [17:03:36] elukey: I'm back now :) [17:04:51] <3 [17:07:23] elukey: I'm running a tcpdump you can start the re-image [17:07:47] I'll force PXE run if it is the same [17:07:51] (quicker) [17:09:01] yep [17:10:16] XioNoX: it is trying PXE [17:10:34] the switch port is not learning any MAC [17:10:41] so I guess it's trying on the wrong interface? [17:11:19] wait now it worked [17:11:34] ah yep it learned it [17:11:39] and I see it [17:11:43] the dhcp [17:11:47] whyyyyyyyyyyyyyy [17:11:52] I tried like 4 times [17:11:55] I guess it just needed to be watched closely [17:12:01] ahahhaha [17:12:02] 4 times since changing the NIC? [17:12:13] a couple since changing the nic [17:12:27] (still not in d-i though) [17:13:55] ok it is in d-i [17:14:23] row E and F don't like me these days [17:18:17] I have other two nodes to reimage, they showed the same behavior [17:18:25] I'll check BIOS on them and try PXE multiple times [17:19:38] you can try the nic upgrade too [17:19:46] always a good opportunity [17:22:12] ack, thanks for the support! [17:22:17] I'll probably cry tomorrow for more [17:22:24] in case please be patient [17:24:42] no pb, I'll be there