[12:52:26] jayme: o/ I was reading T353233, do you think that we should review ml-serve's control plane too? [12:54:31] elukey: you probably have way less nodes - but won't hurt to check I'd say [13:02:27] jayme: ack, anything relevant that I have to pay attention to? I'll try to review and possibly expand, we'll get 8 more nodes on each cluster next Q [13:02:38] Don't let typha break :p [13:02:51] that is the goal :D [13:03:11] I mean beside memory/cpu usage etc.. [13:55:45] elukey: I was only looking at CPU / Mem usage on the controll planes really and saw big CPU spikes during mw deployments for example [13:56:04] plus a elevated base CPU usage with the cluster growing [13:56:40] we'll probably switch from ganeti to hardware nodes soon'ish - also because of the IOPS requirements for etcd [14:12:35] ack thanks! [15:59:29] * inflatador wonders if it would be possible to create "hi iops" ganeti tier...at least, something that's not RAID-5 ;) [16:27:50] jayme: I have a question about change 983191. While the kubeControllers explicitly have their CPU limit set to ~, the typa ones inherit the limit from the main.yaml file (which is also ~). Is there a reason why the limit is explicitly set for one but not the other? [16:28:37] FYI: for https://phabricator.wikimedia.org/T352906 I have bumped the global envoy image version to 1.23.10-2-s4-20231203 (it includes latest patches + WMF CAs). In a merely tangential change, I have bumped all charts to utilize the newer patch levels (x.y.Z) of mesh.configuration in order to utilize the CA bundle that include public CAs and not [16:28:37] just WMF CAs [16:28:56] I 've upgraded all of wikikube services, don't be alarmed if you notice those 2 changes in your own services [16:30:05] inflatador: the RAID5 thing there is just so there is some redundancy and we don't end up with a major and confusing to debug issue if 1 disk fails. It can be any RAID level we want, including JBOD, raid6, raid10, etc [16:31:08] we do got redundancy on the VM level, we can always just failover it to the secondary node and restore the service ofc. RAID5 isn't there for service redundancy. [16:31:24] akosiaris: for my own ignorance - ca-certificates.crt contains public and WMF-internal CA certs? [16:31:42] akosiaris ACK, do we have ganeti tiers already? Or is that just a possibility? [16:35:22] inflatador: yes. The PoP ganeti clusters, IIRC have RAID1 (and just 2 disks) cause they are supposed to have just a couple of core services, essential to a PoP. [16:35:35] elukey: yes [16:36:06] super