[04:42:05] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Fix Armenian sentence tokenization bug in the link recommendation algorithm - https://phabricator.wikimedia.org/T327371 (10kevinbazira) The Armenian sentence tokenization bug has been fixed in T327371#8631149 hywiki has been added to wikis that will be dep... [05:32:34] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Kyrgyz Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T329817 (10kevinbazira) kywiki has been added to wikis that will be deployed in the 11th round T308136 The sentence tokenization bug has been fixed in T327371#8631149 [05:32:50] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Kyrgyz Wikipedia model training pipeline failed - https://phabricator.wikimedia.org/T329817 (10kevinbazira) a:03kevinbazira [07:14:20] 10Machine-Learning-Team, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10ayounsi) 05Open→03Resolved a:03ayounsi [08:15:01] morning folks :) [08:15:03] isaranto: o/ [08:15:10] o/ [08:15:19] I should have fixed the issues that you found yesterday, lemme know if you still see anything weird in staging! [08:15:39] yeah! outlink model worked, I can get predictions from transformer [08:15:52] the issue was that the net-istio controller/webhook (basically the ones that are in charge of reconciling istio configs) were incorrectly using old docker images [08:15:58] \o/ [08:16:30] ok so the last mistery is to figure out what changed with the default-predictor prefix stuff [08:16:41] but it is not a big deal, we can easily adapt [08:16:55] I'll create tasks to move ml-serve clusters to 1.23 as well [08:17:01] either using `outlink-topic-model.articletopic-outlink.wikimedia.org` or `outlink-topic-model-transformer-default.articletopic-outlink.wikimedia.org` as host header [08:17:10] great job guys! [08:17:19] ah wait does the first one work? [08:17:52] because I tried revscoring model servers, and without the -default thing I get a 404 [08:19:01] w8 I am testing everyhting [08:19:23] for outlink it worked [08:20:06] then I am confused [08:20:22] can you also try the revscoring models to see if I am crazy? [08:20:37] in the istio routes I only see [08:20:38] outlink-topic-model-transformer-default.articletopic-outlink [08:22:21] I tried with httpbb got a lot of timeouts [08:23:58] ok it seems like it works with http not with HTTPS. the httpbb tests return a `ERROR: HTTPSConnectionPool(host='inference-staging.svc.codfw.wmnet', port=30443): Read timed out. (read timeout=10)` [08:25:14] mmm nope it doesn't work with HTTP, as expected [08:25:18] I am trying with curl [08:26:06] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Cluster up and running with k8s 1.23! All pods are working, and I already noticed some improvements: no alert has been fired (latency-wise) when deploying all the... [08:27:43] This is working for me [08:27:43] ```curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/outlink-topic-model:predict" -X POST -d @input_outlink.json -i -H "Host: outlink-topic-model.articletopic-outlink.wikimedia.org"``` [08:28:08] isaranto: I copied the tests on deploy1002 to my local home, kept only one for articlequality and added "-predictor-default" to the Host header, httpbb worked [08:31:39] yes so this seems to be what changed [08:31:55] 1) if we have only a predictor, the host header wants "-predictor-default" [08:32:02] 2) if we have a transformer, this is not needed [08:32:11] applying the above I was able to make httpbb working [08:35:16] weird.. [08:35:27] I also don't like the timeouts without the correct header [08:35:36] I expected a quick 404 [08:35:52] yeah they take too long [08:36:42] if this is final lemme fix the httpbb tests [08:36:59] if not I'll wait [08:51:34] not 100% sure yet, lemme dig a bit more [09:05:15] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) [09:06:49] (03PS1) 10AikoChou: revertrisk-multilingual: upgrade base image and trigger new pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/891235 (https://phabricator.wikimedia.org/T329936) [09:14:03] (03CR) 10Elukey: [C: 03+1] revertrisk-multilingual: upgrade base image and trigger new pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/891235 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [09:18:21] isaranto: ok this is weird.. so enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org does work, but it takes 11s to complete [09:18:55] as opposed to enwiki-goodfaith-predictor-default.revscoring-editquality-goodfaith.wikimedia.org that takes 0.4s !!! [09:22:41] wow huge difference [09:22:54] and it doesn't seem on the kserve model server side [09:23:12] so probably the knative/istio routing takes an extra 10s (maybe a timeout somewhere?) [09:23:59] interesting.. [09:53:18] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade the ml-staging-codfw cluster to k8s 1.23 - https://phabricator.wikimedia.org/T327767 (10elukey) Correcting myself - the `enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org` Host header works, but it takes ~10 seconds to complete. [09:55:25] so revert-risk does the same, it seems not a problem with kserve 0.10 vs 0.9 [10:01:10] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "👍" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/891235 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [10:11:24] (03CR) 10AikoChou: [C: 03+2] revertrisk-multilingual: upgrade base image and trigger new pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/891235 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [10:12:30] (03Merged) 10jenkins-bot: revertrisk-multilingual: upgrade base image and trigger new pipeline [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/891235 (https://phabricator.wikimedia.org/T329936) (owner: 10AikoChou) [10:26:37] ok so the issue seems to be the knative-local-gateway [10:27:00] in theory if we use the -predictor-default suffix, we hit the ingress gateway and then the pod directly [10:27:12] but without it, we need to go through the knative-local-gateway [10:27:21] that proxies it [10:27:43] it is another istio gateway basically [10:28:07] I am wondering if the issue with the net-istio pods may have caused config troubles [10:28:49] You mean one shadowing the other? (also \o) [10:32:52] o/ [10:33:14] not sure, maybe there is something misconfigured and the extra hop takes 10s [10:33:21] it is consistent, so it seems a timeout of some sort [10:33:40] 10s sounds like DNS, once more :D [10:34:39] but how? I can't really think about it as problem [10:45:02] You're right, it doesn't make sense. [10:45:37] does a nsenter curl show anything useful? It might show whether the timeout is in resolving or connect() [10:47:57] I am trying to figure out how to measure the issue, since now we have only one set of istio pods [10:48:08] the old "cluster-local-gateway" has been deprecated [10:48:21] now we have two knative Istio gateways only on the Gateway istio pod [10:48:22] *pods [10:48:33] I see logs on those mentioning the 10s [10:48:36] but not really why [10:49:53] also it is weird since sometimes (most of them) it takes 10s, others it is super quick [10:57:19] 10s vs fast does sound like either something cached sometimes _or- it has two ways to do it and picks at random, with one being slow [11:02:31] I recycled the istio pods but didn't improve anything [11:16:26] Going for lunch and a walk. Maybe I can come up with an idea. [11:40:11] * elukey lunch! [12:53:01] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10gmodena) Is there a known timeline / ETA for this upgrade? [12:57:21] * isaranto lunch [14:18:55] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) @gmodena are you doing any experiments on it? If not I can try to do it tomorrow or Friday, I have to wipe everything so this is why I am asking :) [14:29:35] haven't used minikube in a while. having enough RAM makes it much better :) [14:33:49] previously I was using kind a lot, but though I'd give minikube another try [14:45:39] definitely yes [14:49:22] aiko: o/ [14:49:51] if possible deploy only on ml-staging, in prod there is a complication with limits etc.. that will be solved when we upgrade [14:49:54] is it ok?? [14:55:15] elukey: ok! no problem [15:53:29] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10gmodena) @elukey we have some Flink ops tasks this sprint that will require re-deploying our PoC app on DSE. It's unlikely that we'll deploy this week, a... [15:54:45] 10Lift-Wing, 10Machine-Learning-Team: Support the Revert-Review API/tool on Toolforge - https://phabricator.wikimedia.org/T330148 (10MunizaA) [15:57:46] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10Ottomata) @gmodena I think elukey can just do this anytime, no? We don't mind if our stuff is deleted, we can redeploy, right? [15:59:20] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10gmodena) @Ottomata absolutely. Just wanted to sync so we avoid attempting deployments / experiments during a maintenance window. [16:16:19] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) [16:17:47] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10elukey) @gmodena we'll probably do it on Friday, there is some prep-work to be done and it will be done tomorrow :) [16:18:12] 10Machine-Learning-Team, 10Shared-Data-Infrastructure, 10Epic, 10Patch-For-Review: Upgrade DSE to k8s 1.23 - https://phabricator.wikimedia.org/T330261 (10Ottomata) Ah got it, sounds good! [16:49:51] 10Lift-Wing, 10Machine-Learning-Team: Deploy revert-risk multilingual model to production - https://phabricator.wikimedia.org/T325218 (10achou) CI pipeline for the revertrisk-multilingual has been added, the production images can be found in: https://docker-registry.wikimedia.org/wikimedia/machinelearning-lift... [16:55:43] * isaranto deeple regrets upgrading macos to Ventura [17:03:01] *deeply [18:07:17] talk with you tomorrow folks! [18:31:39] 10Machine-Learning-Team, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Research: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10Ottomata) I'm at 80% understanding, but let me try to summarize. There is a new ML model that is... [18:39:34] (03CR) 10Ottomata: events: support multiple source events (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/888190 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [18:42:37] 10Machine-Learning-Team, 10Patch-For-Review: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 (10Ottomata) I know this is more work, and maybe not worth it since we want to eventually deprecate these ORES models (right?), but ... The mediawiki/revision/... [22:57:26] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)