[06:39:01] Good morning!! [06:49:08] morning! [06:52:20] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Found some interesting links about the work that Intel and AMD are doing to support concurrent access to the GPUs: * https://patchwork.kernel.org/project/linux-mm/cover/20190501... [07:20:29] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) >>! In T327923#8796607, @elukey wrote: > I found some cards to evaluate, a good compromise (in my opinion, please chime in if you have more options!) between price and performanc... [07:28:41] https://www.amd.com/en/graphics/instinct-server-accelerators [07:28:43] wow [07:29:01] the MI200 serie GPUs are really brutal [07:29:15] but they cost a fortune so out of our league :) [07:34:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [07:45:19] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:52:49] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10fgiunchedi) [08:22:22] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/914262 should unblock us in theory [08:22:34] let's see what diff comes out [08:26:12] 10Lift-Wing, 10Machine-Learning-Team, 10Documentation: Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10isarantopoulos) Added some notes on the differences between when using Lift Wing instead of ORES: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Differenc... [08:26:34] nice! 🤞 [08:27:18] elukey: I added the documentation regarding lift wing vs ORES usage we had discussed. [08:27:18] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Differences_using_Lift_Wing_instead_of_ORES [08:27:18] lemme know if u think this is not the right place for it [08:28:48] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Gehel) [08:29:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Gehel) [08:32:52] isaranto: I think it is great! Only one nit in the last paragraph, I don't get the " which adds would" bi [08:32:55] *bit [08:32:58] but the rest looks very good thanks! [08:33:46] yeah me neither 😛 [08:33:49] I think it should be s/resolve/result/ [08:34:02] last minute edit , fixing... [08:34:04] Also, morning :) [08:34:16] Morning Tobias! [08:38:31] done! rephrase to `which could lead to increased complexity when implementing a caching mechanism` [08:39:41] LGTM [08:41:45] +1 [08:41:59] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [08:42:54] elukey: I'll take care of the de/pooling for codfw today [08:43:41] (I presumed that was the agreement already, but want to make sure :)) [08:52:12] klausman: sure thanks! [09:08:47] 10Machine-Learning-Team, 10SRE, 10serviceops, 10Language-Team (Language-2023-April-June), and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [09:12:16] isaranto: you were right about tls certs, I hoped that they were done via cfssl but they still need cergen [09:12:19] sigh [09:12:29] we have two options for the ml-staging cluster [09:12:44] 1) we create one ad-hoc endpoint for ores [09:12:57] 2) we create one generic ml-staging.codfw.wmnet endpoint/VIP [09:13:16] I'd be in favor of 2) since it is close to what serviceops does [09:24:14] I'd go with 2 since we most likely will add more services in the future [09:24:43] can I help in any way? [09:27:46] all SRE things but I'll loop you in for code reviews if you have time [09:30:56] ack, thanks [09:31:35] yeah loop me in, probably i won't have anything valuable to add but it helps me understand how things work [09:33:58] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10BTullis) [09:55:13] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [09:59:43] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C s... [10:15:54] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Michaelcochez) We do our development on github. Does it make more sense to restart with a new repository on gitlab to mirror that,... [10:24:07] (03PS17) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [10:29:02] (03PS18) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [10:45:07] 10Machine-Learning-Team: Create a staging ingress configuration for ml-staging-codfw - https://phabricator.wikimedia.org/T335756 (10elukey) [10:49:48] (03PS19) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [10:50:55] I opened up the ores extension patch for reviews. One thing missing is that I am using an external endpoint at the moment. I need to figure out how to do the mapping for the model hostnames so we can use the internal one [10:51:10] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10hashar) >>! In T332953#8819607, @Michaelcochez wrote: > We do our development on github. Does it make more sense to restart with a... [10:52:56] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C s... [11:06:24] <- lunch [11:06:59] elukey: as for the endpoint: I agree 2) is more preferable [11:08:33] (03CR) 10Kevin Bazira: [C: 03+1] "Thank you for working in this, Ilias!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [11:10:50] klausman: ack thanks, going to send some code reviews around [11:10:59] (03PS20) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [11:11:43] I changed the external endpoint to an internal one but don't know of a way to test this [11:25:28] (03CR) 10Ladsgroup: "Generally looks good, It's making me so happy that the change is quite small in comparison to the previous one. Two notes:" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [12:11:15] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:18:46] isaranto: yeah it may be difficult, not sure how we can do it.. [12:19:35] elukey: do u mean that it doesn't have access internally? [12:19:49] Or difficult to test it? [12:19:58] isaranto: nono it needs to access the internal endpoint, I meant that I am not sure what to suggest for testing [12:20:06] Ack [12:20:33] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:20:39] One way I can try is running mediawiki + ores extension on sandbox the way I do locally [12:20:51] I'll try that and see [12:20:54] ack! [12:25:53] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [12:26:59] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:34:58] isaranto: (if you have time) - we'd need to instruct ores-legacy to use localhost:6031 to contact Lift Wing, is it possible? [12:35:10] I mean if there is already a flag [12:35:15] otherwise I'll check the code [12:36:01] Nono I'll do it [12:36:14] U mean on sandbox or in production? [12:36:33] production, since we'll need to use the tls-proxy sidecar [12:36:51] if you have to code it don't worry, I can check later, not urgent [12:40:12] It is just configuration. I'll change it after I test on sandbox [12:40:24] ah I see, LIFTWING_URL [12:40:52] not sure if you can test it in the sandboxy, in there we don't have the sidecar yet IIRC [12:41:05] I'll test it in staging don't worry [12:42:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:43:14] the main issue is that I am not sure if the scaffold stuff is able to add ENV variables [12:43:19] we may have to add the functionality [12:45:54] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [13:03:23] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:00 on... [13:24:32] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [13:25:22] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [13:34:31] sry I thought u were talking about mediawiki extension not ores legacy , my bad [13:35:25] ahhhh sorry! [13:35:54] no I AM sorry 😆 [13:36:03] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10kevinbazira) [13:36:51] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10kevinbazira) 21/21 models were trained successfully in the 17th round of wikis. [13:39:32] elukey: switch maintenance all done and no trouble from our POV [13:39:56] super [13:47:49] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Andrew) [13:48:43] taking a quick break before the meeting [14:02:29] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:02:45] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [14:35:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Trizek-WMF) Just to be sure: has anything of what you discussed impacted the deployment? :) [14:57:06] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [15:00:00] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches... [15:06:58] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went fine! Thanks everybody. [15:17:02] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches... [15:36:44] looks really interesting - https://huggingface.co/docs/accelerate/index [15:36:54] going to log off in a bit folks, talk with you tomorrow! [20:19:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Platform-SRE, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite)