[06:47:23] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) @cmooney awesome analysis, thanks a lot! The issue seems to be https://github.com/kubernetes/kubernetes/issues/82361. It is hitting only Debian Buster no... [06:47:31] morning! [06:59:30] 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move CJK segmentation features to a branch and revert revscoring - https://phabricator.wikimedia.org/T287021 (10elukey) 05Resolved→03Open If possible let's keep this task open to track the deployment to... [08:21:14] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) The issue is also present on worker nodes: ` elukey@ml-serve2001:~$ sudo iptables -L KUBE-FIREWALL -v -n | wc -l 38866 ` I didn't notice any weird cpu p... [08:23:33] elukey: we run kubelet (via kubeadm) on buster on cloud too, but we've been running iptables-legacy there to workaround a calico bug which might explain why we didn't hit that kubelet bug [08:24:07] majavah: I was wondering the same thing on -sre :D [08:24:23] majavah: do you use 1.8.2? [08:24:39] 1.8.2 of what? iptables? [08:24:46] yes sorry [08:24:56] probably yes, I don't think we pull it from backports [08:25:09] ah interesting so using -legacy may be enough [08:25:15] I was reading kubeadm::calico_workaround [08:25:21] I think the original calico bug has been fixed in the meantime, but we haven't migrated back yet [08:25:41] good :D [08:26:52] "Thanks to @danwinship, the next kubernetes version (1.17) will contain a fix regarding the particular KUBE-FIREWALL spamming bug" we're on k8s 1.19, so hopefully that wouldn't even cause issues? [08:29:22] ahhhhh [08:29:27] we are on .16, okok [08:29:32] you are so far in the future :D [08:29:55] perfect then you are fine! [10:28:45] * elukey lunch! [13:33:00] good news! It seems that the weird regression found in https://phabricator.wikimedia.org/T287238 has a fix [13:33:10] we need to upgrade the iptables packages across all k8s buster nodes [13:33:17] that for the moment are only.. ML nodes :) [13:33:45] very sneaky issue, like we didn't have enough with kubeflow's stack :D [14:29:29] sheeeeesh [14:30:08] Thanks majavah and elukey for working on this [14:32:41] morning :) [15:11:19] I am doing a roll reboot of all codfw nodes to start from a clean status [15:11:40] bare metal hosts were affected as well but didn't show the cpu regression (or at least, not as visible as on vms) [16:21:21] going to roll-reboot the eqiad nodes [16:21:34] after this, we should start from a clean state in both places [17:03:33] ok all servers rebooted [17:03:48] one thing that I don't get is why only in eqiad kubectl doesn't work on master nodes anymore [17:04:01] elukey@ml-serve-ctrl1001:~$ sudo kubectl get pods [17:04:01] The connection to the server localhost:8080 was refused - did you specify the right host or port? [17:06:00] but from say deploy1002 works [17:08:40] ahhh wait it might be istio [17:10:04] yep, a nodeport was configured on port 8080 [17:10:08] perfect, all works :) [17:11:15] 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) Deployed the new iptables to all ML buster clusters, preliminary results look really good. Will wait for a day before calling this a victory. [17:12:41] going afk for today, ttl! [17:16:24] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [17:19:31] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) We are one step away from being able to publish our editquality image to the wmf docker registry via the deployment pipeline! I pushed u... [17:53:03] accraze: https://gerrit.wikimedia.org/r/c/integration/config/+/708175 - niceeeee [18:01:17] sooo close to publishing the first image! [18:08:08] lets gooooooo [21:37:28] ahh it turns out that i had an name-mismatch error in the pipeline config for editquality, just made a patch: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/708352 [21:39:24] once that gets merged, we can merge the integrations/config patch and finally publish the editquality image [22:07:29] 10Lift-Wing, 10ML-Governance, 10Machine-Learning-Team (Active Tasks): Outlinks model card - https://phabricator.wikimedia.org/T287527 (10ACraze)