[04:15:45] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kevinbazira) a:03kevinbazira [08:11:25] 10Machine-Learning-Team, 10ORES, 10Documentation, 10User-AKlapper: Update docs that ORES will be replaced by Lift Wing - https://phabricator.wikimedia.org/T305963 (10Aklapper) @calbon: Any news to share? Or ways that others can contribute? I just ended up on https://www.mediawiki.org/wiki/ORES and it does... [09:02:57] 10Machine-Learning-Team, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import new knative-serving version (k8s 1.23 dependency for ML) - https://phabricator.wikimedia.org/T323793 (10elukey) I was able to do a basic test of istio + knative + kserve on minikube:... [09:03:31] 10Machine-Learning-Team, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import new knative-serving version (k8s 1.23 dependency for ML) - https://phabricator.wikimedia.org/T323793 (10elukey) [09:26:18] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) @elukey Thank you for the explanation. I haven't checked about ray workers but I think it is worth the effort as it seems the "standard" way to do parallel infe... [09:36:39] 10Machine-Learning-Team: Upgrade ML clusters to Kubernetes 1.23 - https://phabricator.wikimedia.org/T324542 (10elukey) [11:19:53] * isaranto afk lunch [11:32:02] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) Really nice set of tests! Did you notice any connection timeout or errors from Benthos? The high latencies with SP caused, when I tested it, timeouts and connection pil... [12:38:31] \o [12:38:55] I shared the output of a testing tool I made on Slack. Looks like all the revscoring models work via the API GW :D [12:39:25] The outlink ones I haven't made code for yet (since their queries are different) [12:41:28] (also, memo to self: don't go chasing a bug in the API GW if you use "arcticletopic" as a model name...) [13:20:55] Do we have any idea why some (all?) of the special-wiki models fail? E.g. curl -s "https://inference.discovery.wmnet:30443/v1/models/enwiktionarywiki-reverted:predict" -d '{ "rev_id": 123456 }' -i -H "Host: enwiktionarywiki-reverted.revscoring-editquality-reverted.wikimedia.org" --http1.1 only gives me "{"error": "Could not decode as JSON:\n"}" [13:41:00] weird, did you check kserve-container's logs? [13:48:34] ok so this is what I checked [13:48:45] root@deploy1002:~# kubectl logs enwiktionarywiki-reverted-predictor-default-4lctq-deploymebkr85 -n revscoring-editquality-reverted istio-proxy | grep api --color [13:48:55] afaics this is the call being made to api-ro: [13:49:05] curl -s "https://api-ro.discovery.wmnet/w/api.php?action=query&prop=revisions&revids=123555&rvslots=main&rvprop=contentmodel%7Csize%7Cids%7Ccomment%7Cuserid%7Ctimestamp%7Ccontent%7Cuser&format=json" --header "Host: en.wikitionary.org" [13:49:15] this leads to a HTTP 404 in HTML, not json [13:49:24] and hence the error above [13:52:00] on one appserver I see correctly this vhost [13:52:00] port 80 namevhost wiktionary (/etc/apache2/sites-available/wiktionary.org.conf:1) [13:52:03] wild alias *.wiktionary.org [13:52:06] mmmm [13:52:35] ah no wait I am stupid, I mistyped the Host header [13:52:57] it seems returning json though [13:57:57] hmm, I don't see my requests causing anything to be logged on the container [13:59:31] what do you mean? [14:00:27] Well, if I run that curlc ommand I shared, I'd expect _something_ to show up in the container logs [14:00:46] I saw errors so something is logged, where are you checking? [14:00:54] (I think I found the problem) [14:01:13] from ml-serve-ctrl1001 [14:02:47] mmm but what specifically are you checking there? I usually go either on logstash or on deploy1002 [14:02:51] (for containers I mean) [14:03:16] it's where all my curl history is :) [14:03:22] Maybe Is hould transplant that [14:03:49] So my curl run is on 1001, the kubectl I run from deploy [14:04:23] ah yes okok, I meant where are you checking container logs though (I saw the "I don't see my requests causing anything to be logged on the container [14:04:35] " and I was trying to get where you checked) [14:05:16] on deploy as root with kube_env set to serve-eqiad: `kubectl logs enwiktionarywiki-reverted-predictor-default-4lctq-deploymebkr85 -n revscoring-editquality-reverted istio-proxy -f |grep 10.64.16.202` [14:05:29] the IP is the one of ml-serve-ctrl1001 [14:05:48] The curl now works, what did you change? [14:06:09] sending a code review in a sec [14:06:19] ah okok now it is more clear :) [14:06:39] the istio proxy logs are logging outbound requests to api-ro IIRC [14:09:33] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865078 [14:09:47] this is basically what I changed (atm only for wiktionary) [14:09:58] basically it was missing the rule to proxy that to api-ro [14:10:13] ah, I see [14:10:40] LGTM'd [14:12:22] ah there is also wikisource.org [14:12:56] Yes, the failures I saw for frwikisource are probably related to that [14:13:38] klausman: I am going to add all the domains to ml-serve-eqiad manually now, can you re-run your test and see if anything fails? [14:14:04] sure, sec [14:14:20] done [14:14:24] it should be ready now [14:17:29] frwikisource-articlequality still times out. [14:17:58] translate-reverted fails with an upstream failure from MWAPI [14:18:13] everything else looks fine [14:22:05] so translate has translate.wikipedia.org, maybe the domain is wrong [14:22:35] Oddly enough, a manual curl just hangs. I'll keep investigating [14:22:38] yeah it odesn't exists [14:23:21] what doesn't exist? translate.wikipedia.org? [14:23:49] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Current_Inference_Services I got that sub-model from here, maybe the docs are outdated or wrong? [14:24:27] yeah if you check the domain it doesn't exist [14:24:33] but it is an error on our side [14:24:34] also [14:24:35] extractor_utils:165] An error has occurred while fetching feature values from the MW API: Cannot connect to host incubator.wikimedia.org:443 ssl:default [Connection reset by peer] [14:24:45] eeep [14:24:54] this is for wikisource [14:25:14] never seen incubator.wikimedia.org, we don't inject it in the mwapicache [14:25:30] lemme open a task about the two [14:25:33] It sounds like a staging/testing thing. [14:27:36] Outlink is also currently broken, but that's a typo in the APIGW config that Hugh and I have addressed (still needs to be deployed) [14:28:15] Just confirmed: all the special wikis besides frsource and translate now work fine when access through the API GW) [14:28:32] 10Machine-Learning-Team: Fix translatewiki-reverted and frwikisource-articlequality isvcs - https://phabricator.wikimedia.org/T324567 (10elukey) [14:28:47] created --^ [14:28:51] merci! [14:29:29] also deploying the knative change so we are good [14:29:39] :+1: thanks for your help! [14:34:22] done! [14:34:40] so the remaining two failing wikis need to be handled in the task [14:35:49] * elukey back for the meeting [14:36:56] ttyl [14:37:39] aiko: for the outlink-topic model, what would be a good (sub)set of languages to check to see if it works? [14:43:45] klausman: as I understand the above error is the same described in this tast, correct? https://phabricator.wikimedia.org/T322196 [14:44:07] I encountered some errors especially in articlequality iirc [14:44:36] 10Machine-Learning-Team, 10Patch-For-Review: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10isarantopoulos) I didn't see any timeouts from benthos logs and I forgot to mention above that all these metrics are only for response code 200 as read from the kserve/pod log... [14:45:40] different topic: I see in grafana that there is a pod limit set to 3.15GB while we have set requests and limits to 2GB. Is it something else that is being calculated over there or did it end up allocating 3.15GB? [14:45:40] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=revscoring-drafttopic&var-pod=enwiki-drafttopic-predictor-default-nknc9-deployment-785b6gg8fr&from=1670243139234&to=1670244344075 [14:45:48] isaranto: It may be, hard to say [14:46:12] we'll monitor it with this fix [14:46:15] For one thing, I have not encountered this with the "normal" two-character wikis (enwiki etc) [14:50:34] klausman: o/ It supports all languages in wikipedia, so you can test with en, es, fr, de,... any wiki. The input requires a "lang" and a "rev_id" [14:53:13] and an article name :) [14:53:30] 2022/12/06 15:52:40 200 OK q:{ "rev_id": 123456, "lang": "en", "page_title": "foo" } (resp: 189 bytes, 447.674904ms) [14:53:31] ahh yes [14:53:32] 2022/12/06 15:52:41 200 OK q:{ "rev_id": 123456, "lang": "de", "page_title": "kartoffel" } (resp: 193 bytes, 417.499669ms) [14:53:34] 2022/12/06 15:52:41 200 OK q:{ "rev_id": 123456, "lang": "ru", "page_title": "Злотник" } (resp: 337 bytes, 393.545158ms) [14:53:46] it's page_title [14:53:49] no rev_id [14:54:04] So rev_id is superfluous? good to know [14:54:21] yeah rev_id is not needed [14:54:46] confirmed :) thx! [14:59:23] isaranto: in theory the pod's 3.15 value should be the sum of all resource/limits of all the containers [14:59:48] the 2G one that you mentioned is only for the kserve-container [14:59:59] Aa yes you’re right, thanks! [16:28:02] hey, I created a patch to increase RAM in some model servers -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/865104. lemme know if it is ok! [16:51:51] merged :) [16:52:55] going afk for today folks! [17:00:03] Night elukey!