[05:30:55] (03CR) 10DannyS712: [C: 03+2] build: Updating dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/707083 (owner: 10Libraryupgrader) [05:31:20] (03CR) 10Legoktm: [C: 03+2] build: Updating dependencies [extensions/ORES] - 10https://gerrit.wikimedia.org/r/707083 (owner: 10Libraryupgrader) [08:25:08] hello folks [08:25:42] I have just moved the ml-serve-ctrl1002 (ganeti instance) from drbd to plain (disk template) to see what changes in latencies and resource usage [08:25:52] (we have done it on etcd nodes as well) [08:27:19] the rationale is that having docker + dedicated virtual disk synced via drbd may be a little overkill for our use case [10:34:58] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) p:05Triage→03High [10:35:17] tried to summarize what I have done in --^ [10:37:49] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) The other possibility is some sort of thread / process leak of some sort, even if metrics suggests otherwise: https://grafana.wikimedia.org/d/000000377/h... [10:41:59] * elukey lunch! [13:44:08] missing metrics in https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node for kubelets/calico/etc.. related to ml-serve-ctrl are starting to appear [14:28:40] preliminary metrics from ml-serve-ctrl1002 look good [14:28:45] (for the perf regression) [14:29:21] the drbd disk template is needed to have a way to failover to another vm/ganeti-host in case of a reboot [14:29:38] but it seems to require some amount of sync between nodes of course [14:29:52] if we keep the 'plain' setting we'll not be able to have this feature for ml-serve-ctrl node [14:30:31] but they are in HA set up, so as long as we have the two vms on separate ganeti row we should be fine (namely very low risk of two ganeti nodes running the two vms going down at the same time) [14:30:38] lemme know your thoughts [14:31:06] (with the assumption that this is the root cause of the perf regression, not entirely sure yet, will be after the weekend) [14:54:14] good news! the knative-serving chart should be ready for testing [15:57:09] err: no releases found that matches specified selector(name=knative-serving-crds) and environment(ml-serve-eqiad), in any helmfile [15:57:12] buuuu [16:03:23] it is missing https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/707408 [16:03:39] but I am not sure if it is the right placement in the helmfile base list [16:24:28] have a good weekend folks :) [18:20:55] 10Machine-Learning-Team, 10artificial-intelligence, 10Research: [Epic] Article importance prediction model - https://phabricator.wikimedia.org/T155541 (10Isaac) [23:48:46] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Configure tox tests for editquality inference service pipelines - https://phabricator.wikimedia.org/T287053 (10ACraze) [23:49:10] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Configure tox tests for editquality inference service pipelines - https://phabricator.wikimedia.org/T287053 (10ACraze) a:03ACraze [23:52:56] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Configure tox tests for editquality inference service pipelines - https://phabricator.wikimedia.org/T287053 (10ACraze) Just pushed up a patch that contains a simple tox ini that runs flake8 and black on the editquality model-server. Thi...