[03:22:42] (ErrorBudgetBurn) firing: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:34:21] 10Machine-Learning-Team: Deploy ctranslate2 version of nllb-200 - https://phabricator.wikimedia.org/T351740 (10santhosh) > this will allow the language team to use this model server Please coordinate with us on this. MinT has an evolving sophisticated layer of pre-processing/post-processing steps per language.... [06:39:22] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) [06:39:30] 10Machine-Learning-Team: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing - https://phabricator.wikimedia.org/T348607 (10kevinbazira) 05Open→03Resolved The rec-api-ng container is now able to access endpoints external to k8s/LiftWing even after the wikimedi... [07:02:30] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) [07:02:46] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) 05Open→03Resolved Thank you all for your help with this issue. The rec-api-ng internal endpoint failures were resolved. We will continue to monitor this issue in ca... [07:22:42] (ErrorBudgetBurn) firing: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:51:12] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) Thank you for sharing these benchmarks, @Isaac. I have made the same requests in the ML sandbox with 8 CPUs and here are the results: ## Median (standard)... [08:05:03] Moorning! [09:03:13] morning! [10:05:26] o/ back online! [10:10:13] coffee and paracetamol do wonders :) [10:10:37] yes but take it easy :) [10:31:42] isaranto: is it ok if I restart messing up with RR-agnostic in staging? [10:32:41] feel free to do whatever you want with it! I'm going to work on a local run first so I don't need it at the moment [10:42:05] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10kevinbazira) I have run the same tests on LiftWing with 8 CPUs and no quantization. Here are the results: ## Median (standard) ` kevinbazira@deploy2002:~$ time curl "ht... [10:43:37] morning! [10:44:59] 10Machine-Learning-Team: Deploy ctranslate2 version of nllb-200 - https://phabricator.wikimedia.org/T351740 (10isarantopoulos) Hey @santhosh, thank you for providing the background information on MinT. We can definitely coordinate if you have any requests for a specific model that you would like us to support.... [10:45:27] o/ AIko! [10:46:15] hi Ilias! [10:46:15] the capital I was not intentional 😛 [10:46:41] heheh [10:54:32] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, and 2 others: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10PatchDemoBot) Test wiki on [[ https://patchdemo.wmflabs.org | Patch demo ]] by ISaran... [10:55:39] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Growth-Team, 10Wikipedia-Android-App-Backlog, and 2 others: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]... [10:57:48] elukey: q - I didn't see the new test stream shown in kafka topics 🤔 when I use kafkacat -L .. did I miss sth? [11:00:02] aiko: o/ I think that you need one message to be delivered to see the topic auto-created [11:00:11] on what cluster did you check? [11:00:48] elukey: ahhh kafka-main1001.eqiad.wmnet:9093 [11:01:01] on stat1007 [11:02:54] okok [11:03:14] so ml-staging is in codfw and it will produce events to eventgate-main codfw, that will push the message to main-codfw [11:03:24] but the message will be replicated to eqiad as well [11:03:39] remember that the kafka topics will have the {eqiad,codfw} prefix etc.. [11:03:53] so if you check via kafka-main1001 you'll have to use codfw.etc.. as topic name [11:06:26] okkk got it. I sent a test event and now I see the topic :) [11:09:29] nice :) [11:12:52] and q - what is the difference using port 9092 and 9093? I found using 9093 I need to specify wmf-ca-certificates but using 9092 I don't need to [11:15:08] so 9092 is the plaintext one, 9093 is the TLS port [11:15:55] isaranto: I found a way to use WIKI_URL=http://en.wikipedia.org (not https yet) in RR Agnostic [11:16:08] so without the need to specify api-ro.discovery.wmnet [11:17:02] it is progress but if a redirect comes with https:// we may need more rules in the virtual service [11:19:37] we could add a specific bit that if https is used, then it is proxied to api-ro anyway [11:19:42] but we'd lose metrics [11:19:52] HTTP metrics I mean [11:20:04] since the python code would originate a TLS/HTTPS conn itself [11:20:18] and the istio proxy wouldn't be able to inspect headers etc.. at L7 [11:22:42] (ErrorBudgetBurn) firing: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:27:01] nice! [11:41:07] I'm still trying to do sth with the extension (change the order of the filters) [11:43:17] 10Machine-Learning-Team, 10Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622 (10elukey) I was able to make Revert Risk agnostic to work with `WIKI_URL=http://en.wikipedia.org` instead of `WIKI_URL... [11:43:33] yep yep no hurry, I was just reporting back [11:48:58] (03PS1) 10Elukey: revert-risk: add the option to force HTTP traffic without WIKI_URL set [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) [11:50:39] * elukey lunch! [11:53:18] I added a deployment to add revertrisk to testwiki this afternoon https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231219T1400 🤞 [12:01:03] * isaranto lunch [13:08:37] 10Machine-Learning-Team: Apply common settings to publish events from Lift Wing staging to EventGate - https://phabricator.wikimedia.org/T349919 (10achou) I tested the new testing stream for prediction-change events for outlink model server. 1. First, I collected an event from `mediawiki.page_change.v1` ` $ cat... [13:14:03] * aiko lunch! [13:30:57] back [13:31:07] folks I am enabling ipv6/ipv4 dual stack in staging [13:31:16] this is something that I will not push to prod of course [13:31:30] it shouldn't change anything, but it is something that wikikube has [13:31:42] and it interferes with the new settings for Istio [13:31:59] so if the testing ends up ok in staging, as I think, we'll be able to deploy in January [13:36:15] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revert-risk: add the option to force HTTP traffic without WIKI_URL set (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:38:10] of course it didn't work, staging is impaired :D [13:38:22] ack! will it affect the current services in any way or is it just that we'll be able to use ipv6? [13:38:43] in theory the latter [13:44:46] Morning all [13:45:09] o/ [13:45:41] o/ [13:47:22] (ErrorBudgetBurn) resolved: - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:49:24] (03PS2) 10Elukey: revert-risk: add the option to force HTTP traffic without WIKI_URL set [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) [13:49:34] (03CR) 10Elukey: revert-risk: add the option to force HTTP traffic without WIKI_URL set (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:50:26] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revert-risk: add the option to force HTTP traffic without WIKI_URL set (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:58:09] (03CR) 10Elukey: [C: 03+2] revert-risk: add the option to force HTTP traffic without WIKI_URL set [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:58:13] thanks for the review :) [14:02:14] (03Merged) 10jenkins-bot: revert-risk: add the option to force HTTP traffic without WIKI_URL set [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/984144 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [14:03:03] 10Machine-Learning-Team, 10Data-Platform-SRE, 10Infrastructure-Foundations: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) [14:03:12] and this was the issue --^ [14:28:56] 10Machine-Learning-Team, 10Data-Platform-SRE, 10Infrastructure-Foundations: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) [14:33:22] revertrisk is on testwiki! https://test.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&translations=filter&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [14:33:50] wow! [14:40:46] \o/ [14:41:33] still checking if everything is ok though [14:44:45] 10Machine-Learning-Team, 10artificial-intelligence, 10revscoring: tag being wrongly counted as a ref tag - https://phabricator.wikimedia.org/T353661 (10Gabinaluz) Thank you. Makes sense, I don't think this is high priority. [14:48:58] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984203 [14:49:19] outlink load test and publishing events to eventgate work with no issues. it should be ready for prod ---^ [14:49:25] 10Machine-Learning-Team, 10Data-Platform-SRE, 10Infrastructure-Foundations: Fix IPv6 service IP ranges for all Kubernetes clusters - https://phabricator.wikimedia.org/T353705 (10elukey) New ML staging range: https://netbox.wikimedia.org/ipam/prefixes/887/ New ML Serve codfw range: https://netbox.wikimedia.o... [15:24:13] 10Machine-Learning-Team, 10artificial-intelligence, 10revscoring: tag being wrongly counted as a ref tag - https://phabricator.wikimedia.org/T353661 (10isarantopoulos) 05Open→03Declined [15:28:25] 10Machine-Learning-Team, 10Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622 (10calbon) [15:29:02] 10Machine-Learning-Team: Investigate how to improve model card integration with existing user flows - https://phabricator.wikimedia.org/T353025 (10calbon) [15:30:59] 10Machine-Learning-Team: Investigate how to improve model card integration with existing user flows - https://phabricator.wikimedia.org/T353025 (10calbon) [15:45:12] 10Machine-Learning-Team, 10Project-Admins: Create three Phab Projects for Machine Learning: Lift Wing, Pilot Flag, Test Grounds - https://phabricator.wikimedia.org/T264774 (10calbon) 05Open→03Declined [16:06:09] 10Machine-Learning-Team, 10Patch-For-Review: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing - https://phabricator.wikimedia.org/T353622 (10elukey) Thanks to T353705 we have now the correct ipv6 ranges in puppet, and I was able to enable dual stack in stag... [16:13:32] 10Machine-Learning-Team: Investigate how to improve model card integration with existing user flows - https://phabricator.wikimedia.org/T353025 (10Isaac) Also FYI this article on English Wikipedia where we might want to make some suggestions on the Talk Page about other references that editors might incorporate:... [16:30:05] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Enforce json payload in existing kserve model servers - https://phabricator.wikimedia.org/T352834 (10achou) The changes have been deployed to outlink-topic-model server. :) [16:31:27] https://github.com/istio/istio/issues/46625 [16:31:32] * elukey cries in a corner [16:33:30] at least is is solved right? [16:33:41] yeah in 1.18, we have 1.15 :D [16:33:50] I have already a workaround, but.. [16:34:08] 10Machine-Learning-Team: Upgrade outlink docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347549 (10achou) Outlink-topic-model server on production has been upgraded to KServe 0.11.2. :) [16:34:19] aiko: niceeee --^ [16:34:21] nice aiko! [16:35:51] o/ :D [16:37:53] 10Lift-Wing, 10Machine-Learning-Team: Enforce json payload in existing kserve model servers - https://phabricator.wikimedia.org/T352834 (10isarantopoulos) Thanks Aiko! Resolving this ticket then! [16:38:20] 10Lift-Wing, 10Machine-Learning-Team: Enforce json payload in existing kserve model servers - https://phabricator.wikimedia.org/T352834 (10isarantopoulos) 05In progress→03Resolved [16:40:50] elukey: do you want/need me to review the patches you sent now? [16:42:04] isaranto: nono tomorrow is fine! [16:42:27] the main question mark now is how to fix the redirect issue with the new setup [16:42:30] if it simplifies or not [16:43:12] in theory it may not, but I have an idea to experiment tomorrow about handling via istio https calls [16:45:05] if you have doubts reviewing those lemme know so I can explain :) [16:45:10] 10Machine-Learning-Team: Upgrade outlink docker images to KServe 0.11 - https://phabricator.wikimedia.org/T347549 (10achou) 05Open→03Resolved [16:45:11] I'll add a summary in the task [16:45:57] I may need some guidance with the second one [16:52:05] yes yes it is not 100% clear I know [16:52:12] do you have any specific bit that is obscure? [16:59:12] coredns is deployed in kube-system and it acts as DNS resolver in all k8s clusters [16:59:20] we can tweak its settings etc.. [16:59:39] and in this case, I am explicitly telling it to avoid AAAA records (carrying ipv6 addresses) [16:59:52] so en.wikipedia.org will always be resolved to its ipv4 address [17:00:00] (and all the other ones) [17:00:19] this will avoid to incurr in the Istio CNI bug etc.. [17:01:50] thanks for the explanation! [17:03:24] so to sum up the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984250 disables AAAA records and replaces them with A records for all isvcs [17:05:31] it is already what happens, since we mostly resolve discovery records etc.. that have only ipv4 records [17:05:42] but we force new stuff to stay the same [17:05:49] until we have a full ipv6 stack working :( [17:07:18] on my end I played a bit with the redirects in mwapi. It can be done but it is a bit messy. I'll continue tomorrow morning and update the task [17:07:48] yeah I figured :) [17:07:49] going afk now, coming back fresh tomorrow! [17:07:54] o/ [17:08:04] going afk as well, have a good rest of the day folks! [17:18:07] bye luca and Ilias o/ have a nice evening! [18:28:26] 10Machine-Learning-Team, 10Data-Engineering, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399 (10Ottomata) [18:40:33] 10Machine-Learning-Team, 10Research: Allow to set Catboost's threads in readability-liftwing - https://phabricator.wikimedia.org/T353461 (10Miriam) @elukey hi! Thanks for this! Just checking if you need any input from Research on this task? [20:57:45] 10Machine-Learning-Team, 10ORES: Replace use of $wgCommandLineMode in ORES - https://phabricator.wikimedia.org/T353750 (10matmarex) [20:58:18] 10Machine-Learning-Team, 10ORES: Replace use of $wgCommandLineMode in ORES - https://phabricator.wikimedia.org/T353750 (10matmarex) [21:15:46] 10Machine-Learning-Team: Optimize response performance for the article-descriptions model-server - https://phabricator.wikimedia.org/T353127 (10Isaac) @kevinbazira thanks for the updates and additional test points! Update re Cloud VPS API: I realized that the wmcloud API is running an older form of the library...