[06:05:42] (03CR) 10Ilias Sarantopoulos: "This is marked as WIP until I fix it as MP model servers are still failing." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963064 (owner: 10Ilias Sarantopoulos) [08:43:10] folks if price allows us we should get https://www.amd.com/en/products/server-accelerators/instinct-mi100 [08:43:15] it looks really nice [08:43:28] and it doubles the processing units from MI50 [08:43:39] (32GB of VRAM, etc..) [08:43:53] * elukey errand, bbiab [08:46:36] looks much better! [09:33:15] noice [09:34:30] pricy, but not eye-wateringly so. In CH, you can get for around 2300 CHF/2500 USD [09:45:00] 2500 USD is a relatively cheap price for a GPU in that range [09:45:16] I have seen different prices from various vendors, we'll see what we can get [09:45:52] the other one that we could think about for special use cases is https://www.amd.com/en/products/server-accelerators/amd-instinct-mi210 [09:46:02] 64GB of VRAM [09:46:25] anyway, I think that the budget that we added for this fiscal will allow us to get few GPUs [09:46:35] unless we get good prices [09:47:05] from a chat with Chris it seems that other players are using the model 1 GPU -> one POD/application, without sharing etc.. [09:47:18] so we have to bet on batching to be effective :) [09:47:31] models like RR etc.. could also run on 16GB GPUs, way cheaper [09:57:24] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) Had a chat with Papaul on IRC: * We have space for two GPUs of the MI50/100 size in our chassis (modulo last minute surprises when we mount them on the nodes). * If we want to have two GPUs on the... [09:57:42] ack. I think unless you want absolutely bananas performance, the VRAM size is a big factor in GPU/MPU pricing [09:57:54] brb, need to reboot (yay kernel and glibc updates!) [10:03:02] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) Thanks @elukey. Currently, [[ https://github.com/wikimedia/operations-deployment-charts/blob/76f278de539d16a6704ed82c99b6d4d973d2ded0/helmfile.d/m... [10:13:09] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) >>! In T347475#9224098, @kevinbazira wrote: > Thanks @elukey. Currently, [[ https://github.com/wikimedia/operations-deployment-charts/blob/76f278de539d... [10:16:52] 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) For example, how much time does it take to complete a request on stat1008? From the profiling IIUC it seems that we are around several seconds, you cou... [10:20:59] 10Machine-Learning-Team, 10DC-Ops, 10ops-codfw: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey) [10:21:53] 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) Opened T348118 to follow up with DCops :) [10:25:40] ok opened all tasks for the GPU [10:25:45] we'll see how it goes [10:49:28] kevinbazira: o/ [10:49:34] let's talk about the cpu requirements [10:49:45] I feel that we are not on the same page [10:50:32] we can add a huge number, it would probably alleviate, but reserving a ton of cpu resources for a single pod is probably something that we should do only if strictly needed [10:50:33] elukey: o/ sure. the plan is to test until we get to a suitable configuration. [10:52:08] kevinbazira: as I keep saying, we should focus on finding the bottlenecks first. Even if we find a relatively workable setup, the api will require a ton of resources (probably) and when Content-Translation/Android/etc.. will migrate to it, we'll have to increase a lot pods that will eat more resources [10:52:29] we shouldn't spend a ton of work in improving the code, but there is something that is clearly slow and inefficient [10:55:28] I agree with you, improving code that was written by the original rec-api developers in 2016 is desirable. ATM, the priority is to migrate the rec-api to LW then we can improvements to it afterwards. [10:55:34] 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) @MGerlach we are done! Let us know if we are good or if anything is missing :) [10:56:52] kevinbazira: I don't agree, we should not simply port the app with clear bottlenecks to kubernetes, using a ton of resources for no reason. Deep improvements shouldn't be made, but the bottlenecks in the code will clearly tell us where to fix (even in uwsgi settings). [10:57:27] the API is basically ported, but it is not in a state that we can expose it in my opinion [10:57:52] but, getting back to the original point, we can start to increase workers [10:58:20] 8 is not a good starting number, it is very high and there is no clear justification for it [10:58:55] in k8s we should keep a pod into a state that consumes an adequate amount of resources, and scale up only if needed [10:59:26] having pods with 8 cpus assigned is not very flexible, having pods consuming 2 cpus and scale up when needed is better in my opinion [10:59:29] does it make sense [10:59:29] ? [11:01:49] sure, I have pushed a commit to test the 2 cpus [11:02:51] ok we can proceed in staging [11:03:57] Good morning all [11:04:09] isaranto: o/ I updated all the api portal pages (hopefully) with the content-type header :) [11:04:14] chrisalbon: wow ealy morning :) [11:04:19] Nice [11:04:37] At the moment in stuck with some laptop updates [11:04:54] I've been postponing for a while [11:06:49] chrisalbon: morning! Although I guess it is still night [11:11:03] * elukey lunch! [12:12:19] * isaranto late lunch [12:47:03] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=79579710-b671-4886-a85b-afefbf9b3afb) set by klausman@cumin1001 for 90 days, 0:00:00 on 22 host(s) and their services with reason:... [12:57:32] Ok, all ORES machines except ores1001, ores2001 and orespoolcounter1003 have been shut down [12:57:41] How many servers is that? [12:57:58] 19 total, if I got my math right. [12:58:12] that was my path too, just checking [12:58:19] 8 each in eqiad/codfw for the base machines and anothe three poolcounters [12:58:30] So eventually it will be 21 total [12:59:00] yep. [12:59:07] No, 22 [12:59:14] 1001, 2001 and a pool counter [12:59:16] 18+3? [12:59:55] we already hav 16+3 down, (2x8main+3pc), and another 3 are still up [13:00:56] There nine base machines per DC, eight of which I just shut down [13:01:59] oh, nvm, the poolcounters are VMs :D [13:02:13] So 16 real macchines down, with two to go [13:04:04] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) The machines `ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet` have been shut down (1001 and 2001 are still running in case we need files from them). [13:08:41] \o/ [13:12:11] That's probably a kilowatt or three of power saved [13:21:07] klausman: let's shutdown the others too [13:21:17] we don't really need them :) [13:22:07] alrighty [13:22:59] and done [13:23:18] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) After discussion on IRC, I have also shutdown 1001 and 2001. [13:23:54] I think that's all physical machines. The pooolcounters are Ganeti VM, and I am currently rummaging Horizon for potential leftovers [13:24:17] did you check netstat on poolcounters to make sure that nothing was using it? [13:24:27] I am pretty sure ores is the only one configured to do so [13:24:30] but better be safe [13:24:33] well, they're still up, just donwtimed [13:24:52] They are VMs, so when I shut them down, they booted again :) [13:26:43] klausman: there is also https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen [13:26:49] not sure why it is not linked in the doc [13:27:16] the sre.hosts.decommission also makes the hosts unbootable etc.. [13:27:18] I wanted to wait with the deom ticket for a few days after we have the shutdown, juust in case [13:28:20] klausman: yeah but we cannot make them unbootable if they are down :) [13:28:55] Good point [13:33:00] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Seddon) [13:33:12] 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10calbon) a:03calbon [13:41:01] 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman) [13:41:46] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) File T348144 for decomming. [13:43:59] Ok, all of them powered back on, proceeding with decom script [13:49:03] Dare I ask why they are back on again? [13:49:28] proper decom needs them booted so they can be made unbootable before decom [13:49:36] ah [13:49:36] It's a procedure step I forgot about. [13:49:47] ZOMBIE ORES IS BACK [13:51:28] Lemme get my SPAS-12 [13:58:58] elukey: *sigh* of course four of the machines don't boot properly (not ssh'able after boot) [14:09:04] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2008.codfw.wmnet` - ores2008.co... [14:16:38] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2007.codfw.wmnet` - ores2007.co... [14:17:21] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2005.codfw.wmnet` - ores2005.co... [14:18:04] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2006.codfw.wmnet` - ores2006.co... [14:21:27] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2009.codfw.wmnet` - ores2009.co... [14:38:56] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores[1002-1009].eqiad.wmnet` - ores... [14:39:15] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores[2001-2004].codfw.wmnet` - ores... [14:40:04] all machines except 1001 drun through decom. [14:46:50] 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman) [14:59:40] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10wiki_willy) a:03RobH Adding the procurement project tag. @RobH - can you move this to the S4 space as well? Thanks, Willy [15:02:23] 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) 05Open→03Declined [15:03:27] 10Machine-Learning-Team, 10Goal: Stretch: Swagger UI implemented for every production inference service - https://phabricator.wikimedia.org/T341701 (10calbon) 05Open→03Declined [15:03:36] 10Machine-Learning-Team, 10Goal: Stretch: Inference batching is tested to our satisfaction - https://phabricator.wikimedia.org/T341702 (10calbon) 05Open→03Declined [15:03:43] 10Machine-Learning-Team, 10Goal: Stretch: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon) 05Open→03Declined [15:10:48] 10Machine-Learning-Team: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10calbon) [15:11:04] 10Machine-Learning-Team: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10calbon) a:05AikoChou→03achou [15:15:01] 10Machine-Learning-Team: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10calbon) [15:15:22] 10Machine-Learning-Team: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10calbon) a:03isarantopoulos [15:18:36] 10Machine-Learning-Team: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon) [15:19:42] 10Machine-Learning-Team: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10calbon) [15:19:54] 10Machine-Learning-Team: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon) [15:20:03] 10Machine-Learning-Team: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10calbon) [15:50:27] (03PS1) 10Ilias Sarantopoulos: revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404) [16:09:18] going afk folks! [16:09:20] have a nice one [16:12:10] 10Machine-Learning-Team, 10Patch-For-Review: Refactor inference services repo to allow local runs - https://phabricator.wikimedia.org/T347404 (10isarantopoulos) I have refactored revscoring model servers so that we can run them locally. I have only done this for revscoring so that we keep changes as minimal as... [16:13:15] bye Luca! [16:13:21] I refactored revscoring services so that one can just run them locally without docker. Lemme know what u all think [16:13:32] going afk as well. ciao folks! [17:11:24] 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10prabhat) In the last 50 hours, we haven't seen any "Unsupported lang" issue. Thanks for fixing this. [22:19:43] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) Leaving some thoughts: as Kevin's profiling shows, the majority of the work is going on under the umbrella of `/home/recommendation-api/recommendation/api/external_data/wikid...