[07:22:34] hello folks [08:33:33] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk multilingual model to production - https://phabricator.wikimedia.org/T325218 (10achou) 05Resolved→03In progress [08:39:33] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk multilingual model to production - https://phabricator.wikimedia.org/T325218 (10achou) Current status: - the latest multilingual model was deployed in ml-staging-codfw - working on a separate blubberfile and pipeline for the model, so it no longer share... [08:50:27] (03PS1) 10Ilias Sarantopoulos: outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) [08:55:37] (03CR) 10CI reject: [V: 04-1] outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [09:00:53] o/ [09:01:31] great start of week with dns problem and git not working due to mac upgrade [09:01:32] lol [09:01:36] all good now [09:01:58] aiko: how do u want to proceed with the patches about revertrisk? [09:02:46] I also have outlink ready but it will be based on the changes I did in revertrisk [09:04:28] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) The knative webhook pod keeps erroring out for: ` {"severity":"WARNING","timestamp":"2023-02-20T09:03:18.328573361Z","logger":"webhook","caller":"webhook/webhook... [09:05:50] (03PS8) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) [09:19:58] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [09:22:19] (03CR) 10Elukey: [C: 03+1] revertrisk: upgrade python 3.9 and debian (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [09:24:33] (03PS9) 10Ilias Sarantopoulos: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) [09:26:27] (03CR) 10Elukey: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [09:33:21] ok, I'm going to merge mine and tell you what u need to add in your patch [09:33:49] (03CR) 10Ilias Sarantopoulos: [C: 03+2] revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [09:39:28] (03Merged) 10jenkins-bot: revertrisk: upgrade python 3.9 and debian [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/889502 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [09:42:27] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Fix Armenian sentence tokenization bug in the link recommendation algorithm - https://phabricator.wikimedia.org/T327371 (10kevinbazira) @MGerlach, thank you for the recommendations. I have tested the fix locally but the hywiki training pipeline still got st... [09:45:58] (03PS2) 10Ilias Sarantopoulos: outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) [09:55:47] and a nice Sunday/monday morning read https://journal.arrikto.com/gpu-virtualization-in-k8s-challenges-and-state-of-the-art-a1cafbcdd12b [09:55:57] elukey: ☝️ [10:03:34] (03PS3) 10Ilias Sarantopoulos: outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) [10:04:42] (03CR) 10Ilias Sarantopoulos: "All changes have been tested locally and work fine using transformer and predictor images." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [10:09:13] isaranto: nice thanks! [10:09:19] are there any hopes? [10:16:29] ¯\_(ツ)_/¯ [10:17:14] from other stuff I've read as well using it for 1-3 models could be done [10:18:58] I think for now NVIDIA cuda time-slicing seems to do the trick [10:22:33] I can imagine that it is a game changer for gpu vendors [10:23:52] However there are no memory guarantees and u need to handle this stuff on the application side [10:28:20] so I don't know how that would play out in a prod environment at the moment [10:32:20] \o [10:32:58] Yay: less sore arms because during the weekend, I did not-computer-stuff. Boo: my knees and feet hurt :D [10:37:07] :) [10:39:24] nice to hear that klausman! (I mean the first part) [10:39:59] The knees and feet are mainly for lack of exercise, so it's sortof a good kind of pain. [11:06:27] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) The autoscaler seems to fail as well (in turn it blocks the activator to come up, because it tries to contact the autoscaler as part of its bootstrap): ` root@de... [11:08:26] I have updated some info about knative-serving, it doesn't currently bootstrap correctly on ml-staging --^ [11:08:38] helm at some point rollsback the pods and it clean the mup [11:08:41] *them up [11:15:04] (03CR) 10Ilias Sarantopoulos: revertrisk: create blubberfile and pipeline for each version (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [11:15:31] aiko: I added a comment on what you need to do on your patch so that it plays well with the merged code [11:32:20] isaranto: o/ ack [11:32:56] lemme know if u need anything else [11:35:55] * elukey lunch! [11:53:57] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [12:01:06] (03CR) 10Klausman: [C: 03+1] outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [12:05:24] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Sgs) a:03Sgs [12:06:43] * klausman lunch [12:06:56] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) a:05kevinbazira→03Sgs [12:07:03] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) a:05kevinbazira→03Sgs [12:07:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10Sgs) [12:08:33] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Sgs) [12:08:44] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Sgs) [12:09:01] * isaranto lunch! [12:54:13] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) 18/19 models were trained successfully in the 10th round of wikis. The Kyrgyz Wikipedia (kywiki) pipeline did not complete successfully... [12:56:46] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) [13:27:03] (03CR) 10AikoChou: [C: 03+1] outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [13:38:49] (03PS7) 10AikoChou: revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) [13:41:13] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "👍" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [13:50:38] (03CR) 10AikoChou: revertrisk: create blubberfile and pipeline for each version (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [14:11:58] (03CR) 10Elukey: [C: 03+1] outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [14:22:40] (03CR) 10Elukey: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [14:26:34] klausman: o/ [14:26:38] when you have a moment [14:26:39] https://gerrit.wikimedia.org/r/c/operations/dns/+/889661 [14:26:47] looking [14:27:06] (basically the pre-requisite for https://gerrit.wikimedia.org/r/c/operations/puppet/+/889663) [14:27:16] ack. [14:27:44] do you have some time to merge/roll-them-out? [14:28:07] Hm. Would it be worth it to have both hyphen and underscore entries at the same time, and only remove the underscore one after rollout/reimage? [14:29:07] what do you mean? [14:29:30] in theory this affects only the discovery records used by etcd, nothing more [14:29:39] So if you submit the DNS one now, 5m later the DNS entry for _ will go away, so unless we also quickly push the other, there would be potential for breakage [14:30:08] yes yes there is, but the prod clusters are not really doing anything atm :) [14:30:24] Alrighty then :) [14:31:50] I can do the authdns bits and run-puppet-agent once both are submitted. I presum that plus an etcd restart is enough? [14:32:33] in theory puppet should do it for you, the first runs may not work since if there is the dns record still in cache etc.. [14:32:45] but yeah overall +1 [14:32:47] yeah, I figured. [14:32:53] Ok, then ready when you are :) [14:33:31] do it whenever you want, I am going to keep working on the knative stuff :) ping me if needed [14:33:48] Ay aye cap'n [14:34:16] to confirm: you want me to merge both those changes and do the pokey bits? [14:34:33] elukey: ^^^ [14:34:39] yes yes as you see it fit of course [14:34:53] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [14:34:54] Ok, doing it rightaway [14:35:39] (03CR) 10Ilias Sarantopoulos: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [14:36:06] (03CR) 10Elukey: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:04:40] Ok, all done. The servers still log "read-only range request ... took too long (...) to execute", but I suspect that will settle eventually [15:08:45] nice! Etcd you mean? [15:09:42] I see some of them also yesterday, so shouldn't be related.. [15:13:57] yeah [15:14:18] I'll keep an eye on it, but nothing is alerting and the the etcd graphs omn Grafana look ok [15:17:54] (03CR) 10AikoChou: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:19:07] (03CR) 10Ilias Sarantopoulos: [C: 03+2] outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [15:22:57] (03Merged) 10jenkins-bot: outlink: upgrade python to 3.9 and debian image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890341 (https://phabricator.wikimedia.org/T328438) (owner: 10Ilias Sarantopoulos) [15:26:47] (03PS8) 10AikoChou: revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) [15:27:48] (03PS9) 10AikoChou: revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) [15:31:58] (03CR) 10AikoChou: "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:35:12] 10Machine-Learning-Team: [nsfw] Upgrade python and debian in docker image - https://phabricator.wikimedia.org/T329612 (10isarantopoulos) a:03isarantopoulos [15:36:08] (03PS1) 10Ilias Sarantopoulos: nsfw: upgrade python to 3.9 and debian to bullseye [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/890473 (https://phabricator.wikimedia.org/T329612) [15:36:12] (03CR) 10Elukey: [C: 03+1] revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:36:51] (03CR) 10Ilias Sarantopoulos: revertrisk: create blubberfile and pipeline for each version (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:37:03] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [15:38:22] M1 related error: I am getting an error when I try to run nsfw model locally ```The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine. [15:38:22] qemu: uncaught target signal 6 (Aborted) - core dumped``` [15:39:34] according to https://github.com/tensorflow/tensorflow/issues/52845 it is related to apple silicon (M1). anyone encountered/solved it? haven't managed to do it so far [15:41:48] nope :( [15:43:19] cool, will figure it out! [15:44:47] I would suspect that it either can be compiled without using AVX, or that they provide non-AVX packages somewhere [16:00:15] I am still not able to figure out why the autoscaler pod of knative fails with [16:00:29] Failed to get k8s version Get "https://10.194.62.1:443/version": dial tcp 10.194.62.1:443: i/o timeout [16:00:44] the IP is the ClusterIP for the kubernetes API, it works on other pods [16:01:06] I checked and it should have egress rules to contact the kubernetes IPs (the target ones basically) [16:01:22] I checked with nsenter and indeed curl times out, but I am not sure why [16:01:25] I mean for istio it works fine [16:03:11] really weird [16:15:03] isaranto: I tried to run it on my mac and got the same error. seems no solution right now. maybe you can test the model in ml-sandbox instead [16:18:34] (03CR) 10AikoChou: [C: 03+2] revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [16:24:04] ok found the first issue :D [16:24:05] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890485/ [16:24:15] that is a Luca issue basically [16:24:18] very sneaky :D [16:24:46] (03Merged) 10jenkins-bot: revertrisk: create blubberfile and pipeline for each version [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888824 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [16:29:55] nice work aiko and isaranto :) [16:40:04] elukey: tbh, I would not be surprised if everybody at WMF would have missed that [16:54:10] Aiko: ack. Thanks Luca [16:54:36] This batch of python upgrades were easy. Nothing compared to revscoring [16:54:45] Cya tomorrow folks! [17:21:01] ok so the autoscaler problem is resolved, that in turn fixed also the activator [17:21:04] now I am down to two [17:22:22] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Remaing issues with knative: * the webhook pod doesn't come up ` 2023/02/20 17:19:53 Registering 2 clients 2023/02/20 17:19:53 Registering 2 informer factories... [17:31:48] will restart the fight tomorrow :) [17:31:53] have a nice rest of the day folks! [17:31:56] heading out as well [17:32:03] \o [23:01:09] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) Model evaluation has been completed and below are the backtesting results: | | Precision@0.5 | Recall@0.5 |kawiki | 0.82 | 0.34 |kaawiki...