[06:47:23] <wikibugs>	 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) @cmooney awesome analysis, thanks a lot!  The issue seems to be https://github.com/kubernetes/kubernetes/issues/82361. It is hitting only Debian Buster no...
[06:47:31] <elukey>	 morning!
[06:59:30] <wikibugs>	 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move CJK segmentation features to a branch and revert revscoring - https://phabricator.wikimedia.org/T287021 (10elukey) 05Resolved→03Open If possible let's keep this task open to track the deployment to...
[08:21:14] <wikibugs>	 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) The issue is also present on worker nodes:  ` elukey@ml-serve2001:~$ sudo iptables -L KUBE-FIREWALL -v -n | wc -l 38866 `  I didn't notice any weird cpu p...
[08:23:33] <majavah>	 elukey: we run kubelet (via kubeadm) on buster on cloud too, but we've been running iptables-legacy there to workaround a calico bug which might explain why we didn't hit that kubelet bug
[08:24:07] <elukey>	 majavah: I was wondering the same thing on -sre :D
[08:24:23] <elukey>	 majavah: do you use 1.8.2?
[08:24:39] <majavah>	 1.8.2 of what? iptables?
[08:24:46] <elukey>	 yes sorry
[08:24:56] <majavah>	 probably yes, I don't think we pull it from backports
[08:25:09] <elukey>	 ah interesting so using -legacy may be enough
[08:25:15] <elukey>	 I was reading kubeadm::calico_workaround
[08:25:21] <majavah>	 I think the original calico bug has been fixed in the meantime, but we haven't migrated back yet
[08:25:41] <elukey>	 good :D
[08:26:52] <majavah>	 "Thanks to @danwinship, the next kubernetes version (1.17) will contain a fix regarding the particular KUBE-FIREWALL spamming bug" we're on k8s 1.19, so hopefully that wouldn't even cause issues?
[08:29:22] <elukey>	 ahhhhh
[08:29:27] <elukey>	 we are on .16, okok
[08:29:32] <elukey>	 you are so far in the future :D
[08:29:55] <elukey>	 perfect then you are fine!
[10:28:45] * elukey lunch!
[13:33:00] <elukey>	 good news! It seems that the weird regression found in https://phabricator.wikimedia.org/T287238 has a fix
[13:33:10] <elukey>	 we need to upgrade the iptables packages across all k8s buster nodes
[13:33:17] <elukey>	 that for the moment are only.. ML nodes :)
[13:33:45] <elukey>	 very sneaky issue, like we didn't have enough with kubeflow's stack :D
[14:29:29] <chrisalbon>	 sheeeeesh
[14:30:08] <chrisalbon>	 Thanks majavah and elukey for working on this
[14:32:41] <elukey>	 morning :)
[15:11:19] <elukey>	 I am doing a roll reboot of all codfw nodes to start from a clean status
[15:11:40] <elukey>	 bare metal hosts were affected as well but didn't show the cpu regression (or at least, not as visible as on vms)
[16:21:21] <elukey>	 going to roll-reboot the eqiad nodes
[16:21:34] <elukey>	 after this, we should start from a clean state in both places
[17:03:33] <elukey>	 ok all servers rebooted
[17:03:48] <elukey>	 one thing that I don't get is why only in eqiad kubectl doesn't work on master nodes anymore
[17:04:01] <elukey>	 elukey@ml-serve-ctrl1001:~$ sudo kubectl get pods
[17:04:01] <elukey>	 The connection to the server localhost:8080 was refused - did you specify the right host or port?
[17:06:00] <elukey>	 but from say deploy1002 works
[17:08:40] <elukey>	 ahhh wait it might be istio
[17:10:04] <elukey>	 yep, a nodeport was configured on port 8080
[17:10:08] <elukey>	 perfect, all works :)
[17:11:15] <wikibugs>	 10Machine-Learning-Team: ML Serve controller vms show a slowly increasing resource usage leak over time - https://phabricator.wikimedia.org/T287238 (10elukey) Deployed the new iptables to all ML buster clusters, preliminary results look really good. Will wait for a day before calling this a victory.
[17:12:41] <elukey>	 going afk for today, ttl!
[17:16:24] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)
[17:19:31] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) We are one step away from being able to publish our editquality image to the wmf docker registry via the deployment pipeline! I pushed u...
[17:53:03] <elukey>	 accraze: https://gerrit.wikimedia.org/r/c/integration/config/+/708175 - niceeeee
[18:01:17] <accraze>	 sooo close to publishing the first image!
[18:08:08] <chrisalbon>	 lets gooooooo
[21:37:28] <accraze>	 ahh it turns out that i had an name-mismatch error in the pipeline config for editquality, just made a patch:  https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/708352
[21:39:24] <accraze>	 once that gets merged, we can merge the integrations/config patch and finally publish the editquality image
[22:07:29] <wikibugs>	 10Lift-Wing, 10ML-Governance, 10Machine-Learning-Team (Active Tasks): Outlinks model card - https://phabricator.wikimedia.org/T287527 (10ACraze)