[04:01:52] Thanks luca for the patch! [04:01:58] o/ [04:02:12] Continuing with the rest of the deployments [04:55:17] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) The kserve python package has been updated in all revscoring model servers to v0.11.1 [04:55:46] All servers updated! [04:56:18] I filed a patch to remove the httpbb tests for deprecated model servers (eswikibookswiki etc) [04:56:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/962752 [04:56:49] and I see a 500 in the test for fawiki articlequality, but when I try it manually it succeeds. Investigating... [05:49:30] * isaranto commuting! [06:15:36] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Create a single table with evaluation metrics from all trained add-a-link models - https://phabricator.wikimedia.org/T343374 (10kevinbazira) 05Open→03Resolved [06:15:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) [06:16:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Completion report on training 18 rounds of add-a-link models - https://phabricator.wikimedia.org/T336927 (10kevinbazira) 05Open→03Resolved [07:51:34] hello folks! [07:51:37] nice work isaranto <3 [07:51:52] do you want me to wait for the puppet patch, to test fawiki? [07:54:12] I am going to shutdown ml-staging2001 to allow dcops to check its internals, so we'll know if the GPU MI50 fits [07:54:24] ack [07:55:10] o/ we can merge the httpbb patch (I don't have +2 on that repo). Regarding fawiki it seems ok now. don't know what the issue was [07:56:13] merged! [07:56:20] it seems that it was an issue with mwapi at the moment [08:03:39] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10kevinbazira) Thank you for testing the internal endpoint @Isaac. We are investigating the cause of this issue in T347475 and a possible solution for it. [08:11:22] \o morning [08:12:16] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) ## Investigation Report Without being able to fully test the recommendation-api on wmflabs and LiftWing, I ran a couple of experiments to investig... [08:14:34] isaranto, elukey: o/ I shared findings from an investigation into the performance disparity between the endpoint of the recommendation-api hosted on wmflabs and LiftWing. Here's the report https://phabricator.wikimedia.org/T347475#9218749, in case you want to chime in. [08:23:57] thanks kevin, looking at it now! [08:33:12] kevinbazira: nice work! I'll take a look as well [08:36:21] ok ml-staging2001 is down [08:37:49] elukey: for the rec-api-ng, I presume we make all-new DNS names etc, not overwriting the old ones? [08:38:14] klausman: exactly yes, the current VIP has its own traffic etc.. [08:43:29] (03CR) 10Elukey: [C: 03+1] ores-legacy: return 400 on callback requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961980 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [08:44:44] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: return 400 on callback requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961980 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [08:47:48] (03Merged) 10jenkins-bot: ores-legacy: return 400 on callback requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961980 (https://phabricator.wikimedia.org/T347663) (owner: 10Ilias Sarantopoulos) [08:48:50] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10elukey) >>! In T346151#9211161, @isarantopoulos wrote: > I started by adding an alert for the following query which I borrowed from the [[ https://grafana.wikimedia.org/d/L... [08:51:50] I was reading https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu-rich-enterprise-llms and didn't find much about k8s or how they share GPUs [08:54:55] I wanted to contact them but in the blogpost they try to sell their infra, so I suppose Lamini has some specialized layer that takes care of batching etc.. [08:55:06] they also have 128G of ram GPUs :D [08:57:32] scary :) [09:09:34] just deployed ores-legacy (disabled callback params) [09:16:49] klausman: one qs - should we add the discovery etcd data first (before the DNS entries) [09:17:23] Sure [09:18:40] Hm, actually, I have a question about that [09:20:00] On https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service under "etcd data for backend selection", it says that for k8s, this should not be done, but I presume the next step (etcd data for DNS Discovery) is still relevant (and that's what I did in patch 963009) [09:21:38] can you post the links to the patch please :) ? [09:22:15] https://gerrit.wikimedia.org/r/c/operations/dns/+/963007 [09:22:24] wrong c&p [09:22:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/963009 [09:23:28] okok makes sense [09:24:31] Ok, merged (both on gerrit and pm) [09:42:57] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) Ok! I have updated the alert by adding the kafka consumer lag. I also added one for container memory using the following query to reflect 90% memory usage:... [09:54:20] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati opened https://gitlab.wikimedia.org/repos/structured-data/seal... [10:10:41] * klausman lunch [10:17:11] * elukey lunch as well [10:42:30] * isaranto lunch! [12:05:28] Morning! [12:32:36] Hey! [12:39:59] Thanks kevinbzira for that investigation [12:44:29] ugggh I still need to remake our external team page [12:44:37] too many thinnnngss [12:46:29] elukey ideally we could fit two GPUs in each chasis [12:50:49] elukey: I just realized my "merged both" may have sounded like I merged https://gerrit.wikimedia.org/r/c/operations/dns/+/963007 as well - can I have your +1 on that? [12:51:13] * chrisalbon coffee [13:11:15] chrisalbon: yeah hopefully Papaul will tell us that two MI50 can fit, fingers crossed [13:11:27] yes [13:13:41] klausman: +1ed! [13:13:52] I saw that the dns cookbook ran etc.. [13:13:55] so we are good [13:19:23] klausman: one thing that I realized is that in Q2 (this quarter) we are going to get the new Lift Wing nodes [13:19:27] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Trizek-WMF) [13:19:28] yes, I totally missed the bit where I wa ssupposed to run the cookbook, but Valentin corrected it :) [13:19:45] elukey: add'l nodes? REmind me how many again? [13:20:18] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Trizek-WMF) Let's deploy on next Wednesday (11th). [13:20:19] +8 on each cluster, +2 in staging [13:20:36] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Trizek-WMF) [13:20:42] so we'll have some work to do, in theory not an incredible amount but let's add a goal for this [13:21:28] ack. [13:21:41] do add the note about partman and the kubelet partition to it [13:22:36] Otherwise at least *I* will forget :) [13:30:11] sry for the many notifications on the alerts patch - I added the second alert and CI is failing so i put it back to WIP until I fix it [13:31:10] isaranto: 5 euros for each -1 [13:31:31] [popcorn emoji] [13:31:37] I declare bankruptcy! [13:32:02] klausman: oooof I just realized that we don't need all the IPs allocation etc.. for the new VIP [13:32:08] totally forgot [13:32:12] we are using istio ingress [13:32:29] for ores-legacy we did [13:32:30] ores-legacy 300 IN CNAME k8s-ingress-ml-serve.discovery.wmnet. [13:33:05] so we don't even need conftool data :( [13:33:15] it is just a CNAME [13:33:23] (that is awesome but I totally forgot) [13:34:53] so we need to revert the last two patches that you merged, plus abandon the new VIP [13:36:16] :D [13:36:20] Alright, I'll see to it [13:36:27] let's remember next time [13:36:40] when we use the istio ingress it is all super easy [13:37:34] So revert 963009 on puppet and 963007 on dns? [13:37:40] https://gerrit.wikimedia.org/r/c/operations/dns/+/963007 https://gerrit.wikimedia.org/r/c/operations/puppet/+/963009 [13:38:10] (and delete the netbox entry, and run the authdns scirpt and netbox cookbook) [13:38:15] yes [13:38:28] we just need the CNAME [13:39:17] Ok, revert patches are https://gerrit.wikimedia.org/r/c/operations/puppet/+/963033 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/963033 [13:39:23] er https://gerrit.wikimedia.org/r/c/operations/dns/+/963034 [13:44:36] ok, puppet and dns done, now doing the netbox bits [13:51:22] And CNAME patch sent for review [13:51:59] klausman: have you tried to hit the endpoint to see if it works? [13:52:09] via curl I mean, from say stat1004 [13:53:47] https://phabricator.wikimedia.org/T347263#9200077 indicates there might be probkems [13:54:26] It's the performance/quota issue Kevnin mentioned on Slack [13:54:34] (or here? don't remember) [13:54:46] yeah it is what kevin is working on, but we can test /api/spec (that is fast in theory) [13:55:37] fast yes, but also "no healthy upstream" [13:56:28] Though I dunno if me using a full query made the whole thing enter failed state [13:57:18] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) @kevinbazira I really like the profiling that you did, but I think that the conclusion may not be what we are looking for. You are comparing a result o... [13:59:54] (03PS1) 10Ilias Sarantopoulos: revscoring: fix mp model servers [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963064 [14:35:16] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10Isaac) Thanks for the update @kevinbazira ! Don't hesitate to ask if I can help brainstorm what might be going on etc. if it turns out to be more than just a resources pro... [14:37:23] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) [14:39:32] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) First complete version & DAG ready for code review. [15:05:55] elukey: I presume https://gerrit.wikimedia.org/r/c/operations/puppet/+/963013 is obsolete as well (re: CNAMEs) [15:06:40] yep exactly! [15:06:49] I just tested the endpoint, added some info to the dns change [15:06:52] we should be ready to go [15:07:26] it is soooo much better this setup [15:07:33] no more pybal restarts etc.. [15:07:42] ayup! [15:08:04] need to run an errand, back in ~1h [15:09:22] I'll be out by then, seeya tomorrow [15:27:48] I started updating the multi-processing code that we have for revscoring. Going afk now, will continue tomorrow o/ [15:27:57] \o [15:28:12] night! [15:52:43] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10elukey) Internal endpoint available @kevinbazira: ` elukey@stat1004:~$ curl "https://recommendation-api-ng.discovery.wmnet:31443/api/spec" -i --http1.1 HTTP/1.1 200 OK co... [15:58:28] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) @Isaac looping you in since you may be already aware of some bottlenecks :) Don't feel that you need to code or anything, if you could give us hints as... [15:59:17] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10Papaul) Hello All, I took a look at the AMD GPU and it used the 2x8 pin for power according to the specs of the ml-staging servers we have in codfw, it comes with the GPU ready configuration cable install... [16:01:06] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) @Papaul thanks a lot! Can you also confirm that there is enough space to host one or two GPUs in the chassis? [16:19:14] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 3), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10Ottomata) [16:19:16] 10Machine-Learning-Team, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Research, 10Event-Platform: Proposal: Create a stream end point for Revision Risk Model - https://phabricator.wikimedia.org/T326179 (10Ottomata) [16:21:12] 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10prabhat) @achou Thanks for the heads up. Will verify on our end. [16:34:39] * elukey afk! [18:49:34] 10Lift-Wing, 10Machine-Learning-Team, 10artificial-intelligence: Create a tutorial for deploying a model on toolforge - https://phabricator.wikimedia.org/T281317 (10TBurmeister)