[07:43:44] 10Machine-Learning-Team, 10Patch-For-Review: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10isarantopoulos) 05Open→03In progress [07:44:06] 10Machine-Learning-Team, 10Patch-For-Review: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10isarantopoulos) a:03isarantopoulos [09:28:48] o/ [09:29:31] I finally managed to render the correct helm template. The patch is ready for review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939744 [09:30:34] oups jjust found a bug 🐛 [09:48:36] and fixed it! [09:51:22] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) I have done some testing today: - Run 100 workers (Goroutines) in parallel - Fetch https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-l... [09:53:42] can I do all the changes in the value files in the same patch or does it have to be in a subsequent one? [10:02:51] I am not sure, honestly [10:09:57] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10elukey) @klausman nice analysis! The UnboundLocalError should be fixed by the patches in T341008, but IIUC Aiko didn't deploy it yet to production. Could you pleas... [10:12:44] isaranto: we can keep it split, as long as we merge them together [10:14:02] elukey: I've never built a prod image, it may take me some time :) [10:14:24] klausman: what do you mean? :) [10:15:09] Your comment re RR images. Or was the image already built by CI? [10:15:40] ah yes sorry, it is already built after the merge, it is just a matter of updating deployment-charts and deploying [10:15:51] maybe to staging first to verify that the fix works etc.. [10:15:55] okay, that I know how to do :) [10:15:58] I think Aiko didn't have time to do it [10:16:00] thanks :) [10:16:02] of course (re: staging) [10:16:24] I write all my thoughts to always be on the same page, I forgot so many times basic tests [10:20:12] Hm. There are three recent images that haven't been deployed 07-05, 07-18 and 07-19. (deplyed is 06-40). I'll just use the latest [10:24:36] ack yes [10:25:26] going afk for lunch! [10:31:07] hmm on a second thought if I do all the changes in the same patch we will see in CI only the differences that we are expecting. e.g. eswikibooks instead of eswikibookswiki. Otherwise now we see a lot of changes and the patch by itself is wrong. [10:32:07] I was mostly curious if CI would fail as it would use the old chart but as I see it uses the new version [10:32:37] I'll do the changes in the same patch a bit later. If anyone disagrees with my train of thought let me know before that [10:41:16] staging version of RR-LA looks good, now deplotying to serve-codfw [10:51:21] * isaranto lunch [10:53:08] codfw looks good as well, icnluding scaling up replicas [10:55:58] Pushed to codfw, will then run a longer load test (25 parallel conns going as fast as they can, for 5 minutes, and maybe 10+m later). So far, no unexpected codes (400s for missing revisions, but that is expected/desired). [11:11:32] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) With the fixed version deployed to all clusters, I ran a load test again. Note that the throughput would likely be higher with more workers (i.e. it's li... [12:14:51] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) With 400(!) workers, for 5m: `Code # % ---------------------- 200: 218914 80.28 201: 53607 19.66 503: 166 0.06 Tot: 272687 Throughput: 908.96... [12:31:36] klausman: o/ one qs - did you use multiple rev-ids in the load test? [12:31:54] Yes, randomized [12:32:04] 123000 - 123999 [12:35:16] elukey: do we have rate limits for network bandwidth? [12:35:44] because it seems like the pods seem to flatline against ~1M/s [12:35:52] https://grafana.wikimedia.org/goto/lW1zt9CVk?orgId=1 [12:36:49] I get access denied to the link [12:37:12] anyway, no we don't have rate limits for bandwidth (they would be very difficult to enforce) [12:37:24] ack. [12:37:34] why you'd get access denied is a bit wild, tho [12:38:45] I get a 404 for https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1 [12:39:00] what dashboard are you using? [12:46:32] hang on [12:46:39] https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revertrisk&var-pod=revertrisk-language-agnostic-predictor-default-00008-deplo4nfqw&var-container=All&from=now-6h&to=now [12:46:43] That is the full URL [12:47:00] and the URL I pasted and you just pasted are not the same!? [12:47:28] It may be that this dfashboard can't be seen on grafana-ro, and so the shortened link breaks [12:54:32] now it works [12:56:15] the 1Mbit flatline may be not there, but it does look suspicious. But if we don't have any net rate limits, that's a red herring for the still-happening (though rare) 503s I see [12:56:38] it is 1MBps though [12:57:43] sorry, yes [12:58:33] worth to keep in mind, not sure if it could affect the 503s though [12:58:42] The rate of 500's is typically between <1%, but that is still more than I like. [12:59:02] The pdo-side logs are not conclusive so far. [13:00:19] isaranto: regarding testing before deploy, we first deploy to mwdebug hosts and you can test there [13:01:38] Amir1: ack. I found out there is a fix we need to do before we deploy to eswikibooks and eswikiquote. will let you know once this is done and I'll open a patch in mediawiki-config [13:02:00] sure. Awesome [13:02:50] elukey: for the 500s, I see "response_flags": "DC" (aka Downstream connection termination). Is that towards the client or the server in this? [13:03:00] AIUI, it should be towards the client. [13:04:53] yes is is downstream closes, towards the client, likely some timeout or similar [13:05:19] Ok, so it's neither the kserve nor the istio-proxy container that throws the conn. Now checking queue-proxy [13:05:51] in theory yes [13:07:00] As for perfromance, I regularly see north of 850qps without errors, so scalability is definitely there [13:07:19] with how many pods? [13:07:25] 13, autoscaled [13:07:34] okok [13:07:55] we could also think to review the concurrency range for revert risk agnostic [13:08:01] scale up is pretty quick, less than 10s if the host is warm. Spindown takes minutes (as we discussed) [13:09:18] The queue-proxy doesn't log anything at all, so far. but also no 500s in the last few runs. I'll keep hammering :) [13:11:36] aha! stack trace from queue rpoxy. [13:13:42] It's a timeout. [13:16:04] https://phabricator.wikimedia.org/P49608 [13:18:11] ack keep going :) [13:18:26] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) I am still seeing (rare) 503s, and in the queue proxy pod this is logged: `lang=json { "severity": "ERROR", "timestamp": "2023-07-20T13:10:08.076062... [13:20:23] I have updated the patch with the chart changes. Downside: too many changes in the value files. Upside: minor changes in the end result https://integration.wikimedia.org/ci/job/helm-lint/11833/console . I don't know if you folks agree but to me it seemed a bit more safe this way [13:20:52] sorry for the many changes though [13:21:29] isaranto: I'll check in a bit! [13:21:42] does the change trigger a new host header? [13:22:09] yes but only for eswikiquote and eswikibooks where the -wiki suffix is removed [13:22:51] take your time it is not urgent, we're not going to deploy to these wikis until next week (since deployments dont happen on fridays) [13:26:38] ahh okok [13:53:06] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) In the ingress gateway, these 503s look like this: `lang=json { "downstream_remote_address": "10.67.17.212:52618", "upstream_cluster": "outbound|80|... [14:00:19] I think I unblocked the knative deployment issue, for some reason the kube api wasn't able to connect to the webhook in time [14:00:31] after reading some docs I had to [14:00:44] kubectl delete ValidatingWebhookConfiguration config.webhook.serving.knative.dev -n knative-serving [14:00:51] and then it worked [14:01:10] * elukey mumbles about cloud native horrors [14:09:48] 10Machine-Learning-Team: Fix UnboundLocalError in revert risk models - https://phabricator.wikimedia.org/T341008 (10elukey) 05Open→03Resolved Tobias deployed the new docker images today and they worked, we can close :) [15:03:34] man, the more I chase these 500s, the rarer they become :D [15:04:10] Ther eis an odd difference between HTTP/1.1 and HTTP/2 requests [15:05:30] yeah a small bit will always be there, but if we remove the big part it is ok [15:34:32] I think I'll let the test run at a small qps for a few hours and then look at the istio logs with fresh eyes [15:48:18] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212 (10KHernandez-WMF) [15:59:36] * elukey afk! o/ [17:58:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212 (10KHernandez-WMF) Hello, we're removing the Research tag on this task due to an internal clean-up process. If you would like the Research t... [19:32:57] 10Machine-Learning-Team, 10Research: Index out of range in revert risk multi-lingual - https://phabricator.wikimedia.org/T340811 (10Isaac) @MunizaA -- I managed to separate the headers from sections in the code so it's much cleaner now and seems to run fine for your list of revisions with timeout set to True....