[05:09:39] 10Machine-Learning-Team: Upgrade the link recommendation algorithm from Spark 2 to Spark 3. - https://phabricator.wikimedia.org/T323493 (10kevinbazira) The training pipeline succeeded after fixing code blocks ([[ https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L24-L29 | #1 ]] and [[ htt... [05:43:14] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) [07:09:05] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MoritzMuehlenhoff) [07:24:17] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) MW section masters: - db1100: s5 - db1131: s6 - db1181: s7 Need to downtime the whole sections for these. I'll do it a b... [07:41:19] hi folks! I am going to a doctor appt, will be back online in an hour or so [07:54:00] o/ [09:02:30] 10Machine-Learning-Team, 10Gerrit, 10serviceops-radar, 10Language-Team (Language-2023-January-March): Create Gerrit repository for /services/machinetranslation and migrate code from Gitlab - https://phabricator.wikimedia.org/T331256 (10hashar) The CI Pipeline got added by 9468bc90c866bb7d2bd21cf48065ae64e7... [09:15:31] so I am thinking about how we'll deploy ores-legacy on our clusters [09:16:22] I don't think that we should re-use inference.discovery.wmnet for it (the Load Balancer IP I mean) [09:16:35] but probably something completely different, that we can control [09:16:42] like ores-legacy.discovery.wmnet [09:17:12] and on the istio front, we'll have another set of cfgs that should not impact our current settings [09:17:34] for staging, maybe we should have ores-legacy-staging.svc.codfw.wmnet [09:36:53] fine by me. I agree that it shouldnt be the same with liftwing. other than that I'll need some guidance as I'm not that familiar [09:37:07] with deploying on our clusters [09:37:56] oh yes yes I'll take care of that part [09:38:03] I mean the LB IPs config etc.. [09:38:20] and for the istio configs [09:38:40] but they are autogenerated, probably we may need to tweak them a bit when we discover bug [09:38:43] *bugs [09:47:15] ack [10:18:16] kevin: aiko: I added you as reviewers so that you are aware. if you want we can go through the patch together [10:41:54] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey) [11:11:01] * isaranto lunch [12:09:55] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ssingh) [12:10:19] hey folks I am helping DE to prep for the upcoming network maintenance [12:10:25] ping me if you need anything [12:13:14] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:30:42] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [12:33:07] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:34:56] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:36:54] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10Trizek-WMF) 05Open→03Resolved I checked, and they all work, including `it.wp`. [12:37:39] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:39:06] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:39:48] 10Machine-Learning-Team, 10Data-Engineering, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > We could def put them in the same event stream, as long as they share the same... [12:44:12] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon) [12:45:11] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad... [12:45:23] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:47:01] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:51:16] 10Lift-Wing, 10Machine-Learning-Team, 10Documentation: Improve Lift Wing documentation - https://phabricator.wikimedia.org/T316098 (10elukey) >>! In T316098#8335170, @Miriam wrote: > HI @AikoChou this is wonderful wonderful, thank you so much! > > Most of my feedback was included already in Isaac's comments... [12:52:27] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero) [12:58:05] isaranto: o/ the chart looks good, I added Janis and Alex as reviewers, so they'll be able to comment as well. Once we have their approval we are ready to go [12:59:32] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [13:03:11] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80a32cef-9700-4047-8185-415ffca1aaa2) set by ayounsi@cumin1001 for 2:0... [13:05:58] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad... [13:15:31] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10hnowlan) [13:28:27] network maintenance completed, just repooled the two ores nodes [13:36:36] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing the task as the upgrade is done. It went extremely smoothly, thank you everybody!... [13:48:58] 10Machine-Learning-Team, 10Data-Engineering, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) Okay, so it sounds like we are back to our preferred choice: one prediction pe... [14:21:43] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene) [14:22:15] 10Machine-Learning-Team, 10Foundational Technology Requests: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648 (10calbon) [14:43:38] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C... [15:00:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C... [15:08:53] TIL https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-eks-kubeflow-tutorials-inference.html [15:10:41] which part is the TIL? [15:11:04] sry for asking 😸 [15:12:00] that amazon had Kserve ready to use [15:12:08] I thought it was via manual install [15:12:42] I'd be curious to see how they maintain it in AWS [15:13:54] aa [15:14:49] i think it comes bundled with kubeflow installation (or used to) [15:16:09] one of the issues with kubeflow on eks is that it follows a different release cycle as it has some different times of shipping [15:17:13] oh but I see this has changes now... in the past KF on aws was on KF 1.4 when KF was already on 1.6. Now I see there is 1.6.1 available. if one's infra is on aws it makes much sense to have a managed KF installation [15:17:35] and not have to maintain the beast 😓 [15:25:00] I fear that the worst part of it will be when we'll deploy Kubeflow on DSE [15:26:04] sure... [15:26:20] maybe not deploying.. debugging + maintaining [15:28:57] I fear more the initial setup, it seems really a big one [15:29:16] we have more experience with kserve, hopefully we'll be able to re-use all the istio+knative-serving configs [15:34:08] I have some experience with the initial setup. it is a bit painful but I think we can tackle it easily (especially now that u have the istio experience) [15:34:12] but... FLW [15:34:18] famous last words [16:42:40] * elukey afk! [17:53:24] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto) [19:50:34] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite)