[09:39:00] <elukey>	 https://www.kubeflow.org/docs/components/pipelines/v1/sdk/manipulate-resources/#persistent-volume-claims-pvcs
[09:39:23] <elukey>	 I didn't dig deep but it seems that kubeflow offers a way to integrate with PV
[09:39:35] <elukey>	 not sure if optional or not, we'll see
[09:39:44] <elukey>	 (to follow up from yesterday's discussion)
[09:41:13] <jayme>	 "Request the creation of PVC instances simple and fast" sounds so easy :)
[09:41:47] <elukey>	 exactly :D
[09:42:01] <elukey>	 I am really scared about Kubeflow to be honest
[09:42:21] <elukey>	 I haven't yet checked how many kms of yaml and docker images we'll need to import
[10:01:02] <elukey>	 XioNoX: o/
[10:01:10] <elukey>	 I have 4 nodes in row E/F to reimage, https://netbox.wikimedia.org/search/?q=ml-serve1&obj_type=
[10:01:15] <XioNoX>	 hello, about to join a meeting
[10:01:19] <XioNoX>	 please leave a message
[10:01:26] <elukey>	 shall I proceed or do I need to clear etc.. on the switches?
[10:04:54] <elukey>	 I started the reimage of ml-serve1005, let's see how it goes
[10:27:21] <btullis>	 I'm sorry to have missed the discussion yesterday about Kubeflow and PVCs. We're really not far from getting this new Ceph cluster ready for testing PVC support.
[10:29:11] <btullis>	 elukey: Any chance we could schedule a catch-up to discuss kubeflow, with whoever is interested?
[10:29:54] <elukey>	 btullis: o/ so yesterday we were just wondering what requirements/use-cases we have for PV, and I mentioned that Kubeflow *may* have needed them, but we are still very far from having a complete picture
[10:30:02] <elukey>	 we were more interested in your team's use cases :)
[10:36:46] <akosiaris>	 btullis: you were indeed missed yesterday. It turns out that the other teams currently have either a) 0 use cases, b) have use cases that are kinda wishlist type c) possible future use cases
[10:36:55] <akosiaris>	 and we were missing DE's input 
[10:38:34] <btullis>	 OK, cool. Ours are pretty much b's and c's too :-) 
[10:42:20] <btullis>	 If we take Datahub as an example, it's working today on minikube so you could say that it's not a current requirement. But in order to deploy it we needed: a MariaDB database on a shared DB cluster, three Ganeti VMs for an OpenSearch cluster, and a standalone VM for running Karapace. None of these were huge and the upstream chart is/was already optimized for using PVCs for all of these elements.
[11:05:23] <XioNoX>	 elukey: how is it going?
[11:06:38] <elukey>	 XioNoX: the first host worked nicely, kicked off the other 3 reimages
[11:24:53] <akosiaris>	 btullis: so, we briefly touched on that subject (Datastores vs k8s Persistent Storage, aka PVCs) too. In broad strokes, datastores, whether one means MariaDB/Cassandra/Elasticsearch or whatever else that provides some way of running distributed (e.g. replication), provided some k8s operator, don't necessarily need Persistent Storage. A PVC would
[11:24:53] <akosiaris>	 make the operations of managing them faster and more efficient ofc but it's not a requirement. 
[11:25:29] <akosiaris>	 What's more interesting in those cases is who and how they manage those datastores and whether the level of service provided is adequate for the use cases 
[11:27:10] <akosiaris>	 Other use cases we have are downloading huge blobs on init and writing them to a local ephemeral disk. The typical workflow would be some ML model, stored in swift/s3, downloaded by an initContainer and unzipped in an emptyDir. 
[11:27:38] <akosiaris>	 That's suboptimal of course and a PVC would make it easier and faster to start up those pods
[11:27:52] <akosiaris>	 and also decouples the initialization phase of the pod from the deployment of the model
[11:29:35] <akosiaris>	 btw, what's the ownership model for the dependencies of Datahub? I guess if MariaDB misbehaves DE goes to DBAs for help, but OpenSearch/Karapace ? Entirely on DE's shoulders? 
[11:29:50] <btullis>	 akosiaris: Yes, agreed. Nothing we have in the DE team, even on the horizon, has any hard requirement for a PVC. Also yes, running a reliable datastore in pods is not without its challenges either. Fully agree.
[11:32:51] <btullis>	 akosiaris: Ownership model, yes similarly you've hit the nail on the head. The ownership is pretty much down to DE only. We've had a similar issue with Airflow recently.
[11:33:54] <akosiaris>	 And that's added to all the things DE already owns. Ouch. 
[11:34:03] <btullis>	 Current version of Airflow (2.1) is compatible with MariaDB. Next version of Airflow (2.5) is compatible with only MySQL 8+ or PostgreSQL. 
[11:34:31] <akosiaris>	 wow
[11:34:42] <akosiaris>	 😮
[11:34:49] <akosiaris>	 I did not see that coming 
[11:34:49] <btullis>	 Specifically incompatible with MariaDB, so our only option is to expand our estate with some PostgreSQL servers, which again fall to us to own and manage.
[11:35:10] <akosiaris>	 ah, that explains what I 've seen recently. I 've been meaning to ask about that
[11:35:51] <akosiaris>	 thanks for sharing that context, I wasn't aware
[11:36:57] <btullis>	 We currently manage 5 single-tenant Airflow instances for different teams, so they are currently sharing a physical PostgreSQL server + 1 replica. We have backups, but the Data Persistence team (quite reasonably) doesn't want to provide any guarantees on backup recovery objectives, so we look after that.
[11:38:26] <btullis>	 Admittedly, running distributed PostgreSQL clusters under Kubernetes with PVCs isn't going to be 'a doddle' but using Ceph for this kind of requirement/wishlist item is part of a plan to try to reduce the count of physical servera and VMs that we have to manage as a team.
[11:45:40] <btullis>	 Correction: SQLite is also another option for Airflow 2.5 but you get the idea. That wasn't going to work for us right now.
[11:50:43] <akosiaris>	 I think I get the idea. I am not sure managing an entire Ceph cluster vs managing VMs will end up being less work intensive for DE, but that remains to be seen. I am definitely interested in the effort and wanna try out failure scenarios for some workloads of our own. 
[12:07:06] <btullis>	 We're also looking at making use of the S3 gateway feature, if anything, more than the PVC support. The hope is that over time dse-k8s+ceph this might completely replace Hadoop, which would bring our workload down. Here are some slides that I blithered over yesterday on the Ceph progress: https://docs.google.com/presentation/d/1Lbsn2fEDH0eJqlgWAM_gyOkKXsaE84789DsT_MOJwQo/edit
[15:14:02] <elukey>	 XioNoX: the funny thing with ml-serve100[5-8] is that ml-serve1005 reimaged fine, 1006/7/8 show a weird behavior - the hosts seem to be able to PXE (so they don't ever get in d-i)
[15:14:06] <elukey>	 all of them in E/F
[15:14:26] <XioNoX>	 elukey: could it be a NIC upgrade issue?
[15:14:36] <XioNoX>	 as in do they need to have their NIC upgraded?
[15:14:54] <elukey>	 ah no idea, when we reimaged them months ago nothing weird happened
[15:15:04] <elukey>	 do you mean running the cookbook to upgrade the NIC's firmware?
[15:18:01] <XioNoX>	 sometimes that's the issue
[15:18:14] <XioNoX>	 but can't say for sure for now
[15:18:40] <XioNoX>	 and you mean DON"T seem to be able to PXE?
[15:22:28] <XioNoX>	 switch dhcp-relay config seems, fine so next step is to tcpdump around to see if the install server gets the queries and replies
[15:22:43] <XioNoX>	 do you get anything on the console?
[15:23:16] <XioNoX>	 I have to step away for a bit
[15:27:22] <elukey>	 I am in a meeting too, we can pair later if you have a min
[15:56:15] <elukey>	 ok I tried to manually force PXE while being in `console com2` and I saw a Media failure - check cable error
[16:07:22] <elukey>	 XioNoX: lemme know if you have a min to check with me this issue
[16:07:56] <elukey>	 I don't see DHCP logs on install1004 on syslog related to the IP/mac of ml-serve1006
[16:08:22] <elukey>	 could it be another example of the clear_dhcp() kind of issue?
[16:10:54] <XioNoX>	 elukey: alright, let's try it
[16:13:07] <XioNoX>	 elukey: I see that it have multiple NICs, is is trying to boot on the correct one?
[16:19:07] <elukey>	 XioNoX:  good point, I can check in the BIOS config, but it worked before so I'd be puzzled if something changed in the past months
[16:19:21] <elukey>	 what do you prefer? Shall I retry to force PXE?
[16:20:01] <elukey>	 (checking the BIOS settings in the meantime)
[16:23:01] <XioNoX>	 yeah check BIOS, the check cable stuff seems like it's not even trying to pxeboot
[16:44:57] <elukey>	 XioNoX: I changed some settings, and now I can see the mac address of the NIC with physical link speed that tries to contact DHCP while doing PXE (no more Media error etc..)
[16:45:04] <elukey>	 but it doesn't get any IP
[16:47:40] <XioNoX>	 what's the mac?
[16:48:38] <elukey>	 e4:3d:1a:ad:d7:a0
[16:50:47] <elukey>	 I am tcpdumping on install1004
[16:56:00] <XioNoX>	 thx
[16:56:12] <XioNoX>	 got sidetrack with the esams DDos stuff
[16:57:53] <elukey>	 ah sorry we can do it another time
[17:03:36] <XioNoX>	 elukey: I'm back now :)
[17:04:51] <elukey>	 <3 
[17:07:23] <XioNoX>	 elukey: I'm running a tcpdump you can start the re-image
[17:07:47] <elukey>	 I'll force PXE run if it is the same
[17:07:51] <elukey>	 (quicker)
[17:09:01] <XioNoX>	 yep
[17:10:16] <elukey>	 XioNoX: it is trying PXE
[17:10:34] <XioNoX>	 the switch port is not learning any MAC
[17:10:41] <XioNoX>	 so I guess it's trying on the wrong interface?
[17:11:19] <elukey>	 wait now it worked
[17:11:34] <XioNoX>	 ah yep it learned it
[17:11:39] <XioNoX>	 and I see it
[17:11:43] <XioNoX>	 the dhcp
[17:11:47] <elukey>	 whyyyyyyyyyyyyyy
[17:11:52] <elukey>	 I tried like 4 times
[17:11:55] <XioNoX>	 I guess it just needed to be watched closely
[17:12:01] <elukey>	 ahahhaha
[17:12:02] <XioNoX>	 4 times since changing the NIC?
[17:12:13] <elukey>	 a couple since changing the nic
[17:12:27] <elukey>	 (still not in d-i though)
[17:13:55] <elukey>	 ok it is in d-i
[17:14:23] <elukey>	 row E and F don't like me these days
[17:18:17] <elukey>	 I have other two nodes to reimage, they showed the same behavior
[17:18:25] <elukey>	 I'll check BIOS on them and try PXE multiple times
[17:19:38] <XioNoX>	 you can try the nic upgrade too
[17:19:46] <XioNoX>	 always a good opportunity
[17:22:12] <elukey>	 ack, thanks for the support!
[17:22:17] <elukey>	 I'll probably cry tomorrow for more
[17:22:24] <elukey>	 in case please be patient
[17:24:42] <XioNoX>	 no pb, I'll be there