[06:05:42] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This is marked as WIP until I fix it as MP model servers are still failing." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963064 (owner: 10Ilias Sarantopoulos)
[08:43:10] <elukey>	 folks if price allows us we should get https://www.amd.com/en/products/server-accelerators/instinct-mi100
[08:43:15] <elukey>	 it looks really nice
[08:43:28] <elukey>	 and it doubles the processing units from MI50
[08:43:39] <elukey>	 (32GB of VRAM, etc..)
[08:43:53] * elukey errand, bbiab
[08:46:36] <isaranto>	 looks much better!
[09:33:15] <klausman>	 noice
[09:34:30] <klausman>	 pricy, but not eye-wateringly so. In CH, you can get for around 2300 CHF/2500 USD
[09:45:00] <elukey>	 2500 USD is a relatively cheap price for a GPU in that range
[09:45:16] <elukey>	 I have seen different prices from various vendors, we'll see what we can get
[09:45:52] <elukey>	 the other one that we could think about for special use cases is https://www.amd.com/en/products/server-accelerators/amd-instinct-mi210
[09:46:02] <elukey>	 64GB of VRAM
[09:46:25] <elukey>	 anyway, I think that the budget that we added for this fiscal will allow us  to get few GPUs
[09:46:35] <elukey>	 unless we get good prices
[09:47:05] <elukey>	 from a chat with Chris it seems that other players are using the model 1 GPU -> one POD/application, without sharing etc..
[09:47:18] <elukey>	 so we have to bet on batching to be effective :)
[09:47:31] <elukey>	 models like RR etc.. could also run on 16GB GPUs, way cheaper
[09:57:24] <wikibugs>	 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) Had a chat with Papaul on IRC:  * We have space for two GPUs of the MI50/100 size in our chassis (modulo last minute surprises when we mount them on the nodes). * If we want to have two GPUs on the...
[09:57:42] <klausman>	 ack. I think unless you want absolutely bananas performance, the VRAM size is a big factor in GPU/MPU pricing
[09:57:54] <klausman>	 brb, need to reboot (yay kernel and glibc updates!)
[10:03:02] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) Thanks @elukey. Currently, [[ https://github.com/wikimedia/operations-deployment-charts/blob/76f278de539d16a6704ed82c99b6d4d973d2ded0/helmfile.d/m...
[10:13:09] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) >>! In T347475#9224098, @kevinbazira wrote: > Thanks @elukey. Currently, [[ https://github.com/wikimedia/operations-deployment-charts/blob/76f278de539d...
[10:16:52] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10elukey) For example, how much time does it take to complete a request on stat1008? From the profiling IIUC it seems that we are around several seconds, you cou...
[10:20:59] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10ops-codfw: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10elukey)
[10:21:53] <wikibugs>	 10Machine-Learning-Team, 10Goal: Order 1 GPU for Lift Wing - https://phabricator.wikimedia.org/T341699 (10elukey) Opened T348118 to follow up with DCops :)
[10:25:40] <elukey>	 ok opened all tasks for the GPU
[10:25:45] <elukey>	 we'll see how it goes
[10:49:28] <elukey>	 kevinbazira: o/
[10:49:34] <elukey>	 let's talk about the cpu requirements 
[10:49:45] <elukey>	 I feel that we are not on the same page
[10:50:32] <elukey>	 we can add a huge number, it would probably alleviate, but reserving a ton of cpu resources for a single pod is probably something that we should do only if strictly needed
[10:50:33] <kevinbazira>	 elukey: o/ sure. the plan is to test until we get to a suitable configuration.
[10:52:08] <elukey>	 kevinbazira: as I keep saying, we should focus on finding the bottlenecks first. Even if we find a relatively workable setup, the api will require a ton of resources (probably) and when Content-Translation/Android/etc.. will migrate to it, we'll have to increase a lot pods that will eat more resources
[10:52:29] <elukey>	 we shouldn't spend a ton of work in improving the code, but there is something that is clearly slow and inefficient
[10:55:28] <kevinbazira>	 I agree with you, improving code that was written by the original rec-api developers in 2016 is desirable. ATM, the priority is to migrate the rec-api to LW then we can improvements to it afterwards.
[10:55:34] <wikibugs>	 10Machine-Learning-Team, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10elukey) @MGerlach we are done! Let us know if we are good or if anything is missing :)
[10:56:52] <elukey>	 kevinbazira: I don't agree, we should not simply port the app with clear bottlenecks to kubernetes, using a ton of resources for no reason. Deep improvements shouldn't be made, but the bottlenecks in the code will clearly tell us where to fix (even in uwsgi settings).
[10:57:27] <elukey>	 the API is basically ported, but it is not in a state that we can expose it in my opinion
[10:57:52] <elukey>	 but, getting back to the original point, we can start to increase workers
[10:58:20] <elukey>	 8 is not a good starting number, it is very high and there is no clear justification for it
[10:58:55] <elukey>	 in k8s we should keep a pod into a state that consumes an adequate amount of resources, and scale up only if needed
[10:59:26] <elukey>	 having pods with 8 cpus assigned is not very flexible, having pods consuming 2 cpus and scale up when needed is better in my opinion
[10:59:29] <elukey>	 does it make sense
[10:59:29] <elukey>	 ?
[11:01:49] <kevinbazira>	 sure, I have pushed a commit to test the 2 cpus
[11:02:51] <elukey>	 ok we can proceed in staging
[11:03:57] <chrisalbon>	 Good morning all
[11:04:09] <elukey>	 isaranto: o/ I updated all the api portal pages (hopefully) with the content-type header :)
[11:04:14] <elukey>	 chrisalbon: wow ealy morning :)
[11:04:19] <isaranto>	 Nice
[11:04:37] <isaranto>	 At the moment in stuck with some laptop updates
[11:04:54] <isaranto>	 I've been postponing for a while
[11:06:49] <isaranto>	 chrisalbon: morning! Although I guess it is still night
[11:11:03] * elukey lunch!
[12:12:19] * isaranto late lunch
[12:47:03] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=79579710-b671-4886-a85b-afefbf9b3afb) set by klausman@cumin1001 for 90 days, 0:00:00 on 22 host(s) and their services with reason:...
[12:57:32] <klausman>	 Ok, all ORES machines except ores1001, ores2001 and orespoolcounter1003 have been shut down
[12:57:41] <chrisalbon>	 How many servers is that?
[12:57:58] <klausman>	 19 total, if I got my math right.
[12:58:12] <chrisalbon>	 that was my path too, just checking
[12:58:19] <klausman>	 8 each in eqiad/codfw for the base machines and anothe three poolcounters
[12:58:30] <chrisalbon>	 So eventually it will be 21 total
[12:59:00] <klausman>	 yep.
[12:59:07] <klausman>	 No, 22
[12:59:14] <klausman>	 1001, 2001 and a pool counter
[12:59:16] <chrisalbon>	 18+3?
[12:59:55] <klausman>	 we already hav 16+3 down, (2x8main+3pc), and another 3 are still up
[13:00:56] <klausman>	 There nine base machines per DC, eight of which I just shut down
[13:01:59] <klausman>	 oh, nvm, the poolcounters are VMs :D
[13:02:13] <klausman>	 So 16 real macchines down, with two to go
[13:04:04] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) The machines `ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet` have been shut down (1001 and 2001 are still running in case we need files from them).
[13:08:41] <isaranto>	 \o/
[13:12:11] <klausman>	 That's probably a kilowatt or three of power saved
[13:21:07] <elukey>	 klausman: let's shutdown the others too
[13:21:17] <elukey>	 we don't really need them :)
[13:22:07] <klausman>	 alrighty
[13:22:59] <klausman>	 and done
[13:23:18] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) After discussion on IRC, I have also shutdown 1001 and 2001.
[13:23:54] <klausman>	 I think that's all physical machines. The pooolcounters are Ganeti VM, and I am currently rummaging Horizon for potential leftovers
[13:24:17] <elukey>	 did you check netstat on poolcounters to make sure that nothing was using it?
[13:24:27] <elukey>	 I am pretty sure ores is the only one configured to do so
[13:24:30] <elukey>	 but better be safe
[13:24:33] <klausman>	 well, they're still up, just donwtimed
[13:24:52] <klausman>	 They are VMs, so when I shut them down, they booted again :)
[13:26:43] <elukey>	 klausman: there is also https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen
[13:26:49] <elukey>	 not sure why it is not linked in the doc
[13:27:16] <elukey>	 the sre.hosts.decommission also makes the hosts unbootable etc..
[13:27:18] <klausman>	 I wanted to wait with the deom ticket for a few days after we have the shutdown, juust in case
[13:28:20] <elukey>	 klausman: yeah but we cannot make them unbootable if they are down :)
[13:28:55] <klausman>	 Good point
[13:33:00] <wikibugs>	 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10Seddon)
[13:33:12] <wikibugs>	 10Machine-Learning-Team, 10Wikipedia-Android-App-Backlog (Android Release - FY2023-24): Migrate Machine-generated Article Descriptions from toolforge to liftwing. - https://phabricator.wikimedia.org/T343123 (10calbon) a:03calbon
[13:41:01] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman)
[13:41:46] <wikibugs>	 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) File T348144 for decomming.
[13:43:59] <klausman>	 Ok, all of them powered back on, proceeding with decom script
[13:49:03] <chrisalbon>	 Dare I ask why they are back on again?
[13:49:28] <klausman>	 proper decom needs them booted so they can be made unbootable before decom
[13:49:36] <chrisalbon>	 ah
[13:49:36] <klausman>	 It's a procedure step I forgot about.
[13:49:47] <chrisalbon>	 ZOMBIE ORES IS BACK
[13:51:28] <klausman>	 Lemme get my SPAS-12
[13:58:58] <klausman>	 elukey: *sigh* of course four of the machines don't boot properly (not ssh'able after boot)
[14:09:04] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2008.codfw.wmnet` - ores2008.co...
[14:16:38] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2007.codfw.wmnet` - ores2007.co...
[14:17:21] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2005.codfw.wmnet` - ores2005.co...
[14:18:04] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2006.codfw.wmnet` - ores2006.co...
[14:21:27] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores2009.codfw.wmnet` - ores2009.co...
[14:38:56] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores[1002-1009].eqiad.wmnet` - ores...
[14:39:15] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by klausman@cumin1001 for hosts: `ores[2001-2004].codfw.wmnet` - ores...
[14:40:04] <klausman>	 all machines except 1001 drun through decom.
[14:46:50] <wikibugs>	 10Machine-Learning-Team, 10decommission-hardware, 10Patch-For-Review: decommission ores{1001..1009,2001..2009}.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T348144 (10klausman)
[14:59:40] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10procurement: GPU purchase for ml-staging in codfw - https://phabricator.wikimedia.org/T348118 (10wiki_willy) a:03RobH Adding the procurement project tag.  @RobH - can you move this to the S4 space as well?  Thanks, Willy
[15:02:23] <wikibugs>	 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) 05Open→03Declined
[15:03:27] <wikibugs>	 10Machine-Learning-Team, 10Goal: Stretch: Swagger UI implemented for every production inference service - https://phabricator.wikimedia.org/T341701 (10calbon) 05Open→03Declined
[15:03:36] <wikibugs>	 10Machine-Learning-Team, 10Goal: Stretch: Inference batching is tested to our satisfaction - https://phabricator.wikimedia.org/T341702 (10calbon) 05Open→03Declined
[15:03:43] <wikibugs>	 10Machine-Learning-Team, 10Goal: Stretch: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon) 05Open→03Declined
[15:10:48] <wikibugs>	 10Machine-Learning-Team: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10calbon)
[15:11:04] <wikibugs>	 10Machine-Learning-Team: Goal: Lift Wing users can request multiple predictions using a single request. - https://phabricator.wikimedia.org/T348153 (10calbon) a:05AikoChou→03achou
[15:15:01] <wikibugs>	 10Machine-Learning-Team: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10calbon)
[15:15:22] <wikibugs>	 10Machine-Learning-Team: Goal: Users can query a large language model using the API Gateway and receive a response in a reasonable amount of time. - https://phabricator.wikimedia.org/T348154 (10calbon) a:03isarantopoulos
[15:18:36] <wikibugs>	 10Machine-Learning-Team: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon)
[15:19:42] <wikibugs>	 10Machine-Learning-Team: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10calbon)
[15:19:54] <wikibugs>	 10Machine-Learning-Team: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10calbon)
[15:20:03] <wikibugs>	 10Machine-Learning-Team: Goal: Increase the number of models hosted on Lift Wing - https://phabricator.wikimedia.org/T348156 (10calbon)
[15:50:27] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: revscoring: allow local runs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/963367 (https://phabricator.wikimedia.org/T347404)
[16:09:18] <elukey>	 going afk folks!
[16:09:20] <elukey>	 have a nice one
[16:12:10] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Refactor inference services repo to allow local runs - https://phabricator.wikimedia.org/T347404 (10isarantopoulos) I have refactored revscoring model servers so that we can run them locally. I have only done this for revscoring so that we keep changes as minimal as...
[16:13:15] <isaranto>	 bye Luca!
[16:13:21] <isaranto>	  I refactored revscoring services so that one can just run them locally without docker. Lemme know what u all think
[16:13:32] <isaranto>	 going afk as well. ciao folks!
[17:11:24] <wikibugs>	 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10prabhat) In the last 50 hours, we haven't seen any "Unsupported lang" issue. Thanks for fixing this.
[22:19:43] <wikibugs>	 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10Isaac) Leaving some thoughts: as Kevin's profiling shows, the majority of the work is going on under the umbrella of `/home/recommendation-api/recommendation/api/external_data/wikid...