[07:52:56] morning folks! [07:55:03] elukey: thanks for creating the task :) np, I'd like to schedule the deploy this week [07:56:37] morning : [07:56:37] :) [07:57:45] elukey: do you have time for a short meeting? [07:58:38] elukey: I want to discuss something related to Diego's project [07:59:05] aiko: I have a meeting with Joseph in a minute, maybe this afternoon? [07:59:07] too late? [08:00:38] elukey: does 2pm work for you? [08:01:30] sure [08:02:31] oki :D [09:41:00] klausman: o/ I was checking one thing in the staging cluster, and I started wondering about its future usage [09:41:12] since it is not listed in puppet as child of ml-serve, but a standalone [09:41:22] do we think that we'll also test the full kubeflow on it? [09:41:35] meaning something different from the ml-serve use cases [09:41:39] Good idea [09:41:43] it is an open question, I am still debating with myself [09:42:12] We mostly made it independent so we could choose versions etc more freely without accidentally knocking over ml-serve [09:42:21] because if we decide that it will support serving-only, it may be good to just have it as child of ml-serve [09:42:52] sure sure but we can keep things configured separately if we want [09:43:10] even if it is a child of ml-serve I mean [09:43:18] Hrm, let me have a think for a moment [09:43:51] yes yes no rush, I am going to buy some cherries now, we can discuss when I come back after lunch etc.. [09:43:54] only food for thoughts [09:44:08] (going afk, ttl! [09:44:22] ttyl; [10:38:30] <- lunch and errands [11:45:13] the more I think about it the more I'd be inclined to use the staging cluster to test/support the ml-serve use case [11:45:37] trainwing on the DSE cluster will have a completely different config, and we'll likely not need any staging [11:45:43] anyway, this is a proposal :) [11:45:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/801662 [11:45:45] lemme know [11:57:21] kevinbazira: o/ you can deploy anytime [11:58:05] thanks for the merge, elukey. starting the deployment now ... [12:01:22] both eqiad and codfw deployments have been completed successfully [12:01:28] elukey: Looking. And I think you're right. No k8s cluster we build will be really useful to test things in both the serving and the training cases without a full re-image, at which point there's no use in trying to straddle the two [12:01:35] checking pods now ... [12:03:21] NAME READY STATUS RESTARTS AGE [12:03:21] glwiki-articlequality-predictor-default-6vx6n-deployment-6xgbf2 3/3 Running 0 2m23s [12:03:21] nlwiki-articlequality-predictor-default-l7rrs-deployment-5mlsjs 3/3 Running 0 2m22s [12:03:21] all new pods are up and running. \o/ [12:03:39] Very nice [12:14:21] Morning all! [12:37:38] klausman: ack thanks! [12:41:05] klausman: I am trying to understand why we have higher than usual latencies registered by ml-serve-ctrl nodes [12:41:24] and afaics the problem is in the time that etcd takes to answer [12:41:40] so I'd like to try to turn off again DRDB for our etcd clusters [12:41:40] https://wikitech.wikimedia.org/wiki/Ganeti#Change_disk_template_for_a_VM_(aka_drop_DBRD) [12:41:45] just to remove a variable [12:41:50] (the other clusters don't have it) [12:42:03] we'd be a little less resilient to vm failures/migrations [12:42:15] but I suspect also more efficient in latencies [12:42:22] (if not I'll re-add it again) [12:47:21] (bbiab) [12:48:38] Sounds good! [12:48:54] Will we have to re-image teh VMs? [13:31:07] klausman: nope just shutdown start [13:31:15] so very quick and painless [13:36:52] ah lovely this caused, apparently, the interface "ens5" to be renamed "ens13" [13:37:01] so I had to modify /etc/network/interfaces [13:37:02] sigh [13:38:05] (ml-etcd1001) [13:43:39] perfect ml-etcd100[1-3] migrated, let's see if the latencies are better in the long term [13:46:42] Very nice. [13:46:51] The network interface thing is weird [13:47:23] I suspect the drbd setup is ivible in a different PCI device (or device order), thus moving the NIC from slot 5 to 13 or something. [13:50:55] yeah I agree [13:54:20] 10ORES, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: revscoring feature extraction error for wikitext papes in Wikidata - https://phabricator.wikimedia.org/T302851 (10elukey) Last step is to deploy ORES, T309536 [13:57:45] klausman: I just noticed that one of the side effects of having ml-staging-codfw under the ml-serve umbrella is that it forces it to use the ml-serve infra users tokens [13:57:53] I think that service ops does the same [13:58:01] the tokens are shared between staging and prod [13:58:18] Arguments could be me made either way for separate vs. the same tokens. [13:58:42] But I think for the sake of simplicity we might as well keep them the same until we find a reason not to [13:58:57] ack perfect, otherwise it is puppet hell I am afraid :( [14:48:14] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Test async predict on kserve - https://phabricator.wikimedia.org/T309623 (10elukey) [14:49:43] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Test Ray worker in Kserve - https://phabricator.wikimedia.org/T309624 (10elukey) [15:45:45] We are all deploying ores tomorrow? [15:46:02] Ores party? [15:48:28] chrisalbon: whoever wants to join :D [15:48:33] Aiko will drive [15:49:49] Let’s gooooooo! [16:30:01] have a nice rest of the day folks! [16:33:11] bye Luca! :) [21:35:53] 10Machine-Learning-Team, 10Discovery-Search, 10Epic, 10Growth-Team (Current Sprint): [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) - https://phabricator.wikimedia.org/T240517 (10Tgr)