[00:23:22] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10tstarling) [00:26:08] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10tstarling) Will the GitHub mirrors be switched over to replicate from GitLab? This is necessary for libraries like Shellbox that us... [06:29:18] Morning! [06:37:32] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @Ladsgroup During this operation, replication codfw -> eqiad is still active, so as there are codfw masters involved (even... [06:50:48] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 (10kevinbazira) [07:35:48] (03PS9) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [07:37:07] (03CR) 10jenkins-bot: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [08:04:07] o/ morning :) [08:15:37] hello :) [08:32:00] (03PS10) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [08:33:26] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [08:37:06] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox, 10Patch-For-Review: Upgrade ROCm to 5.4 - https://phabricator.wikimedia.org/T295661 (10elukey) [08:38:56] 10Machine-Learning-Team, 10Epic: Add GPUs to the Machine Learning infrastructure - https://phabricator.wikimedia.org/T333462 (10elukey) [08:39:02] 10Machine-Learning-Team, 10Analytics-Radar, 10Data-Engineering-Icebox: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) 05Open→03Declined New procedure found and documented in T295661 [08:49:02] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Another good article to keep as reference for Nvidia: https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/ [08:54:32] (03PS11) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [08:56:12] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [08:58:43] 10Machine-Learning-Team, 10Epic: Test KServe inference batching - https://phabricator.wikimedia.org/T335480 (10elukey) [09:03:41] 10Machine-Learning-Team: Investigate procuring and installing GPUs on Lift Wing - https://phabricator.wikimedia.org/T327923 (10elukey) Some news on the AMD front - we successfully tested GPUs on K8s in T333009 (DSE cluster), and the KServe upstream folks suggested to use inference batching to improve the through... [09:04:01] \o Morning. fighting with isp issues atm [09:06:17] morning! ack [09:12:26] isp always wins :) [09:13:47] Well, they're trying to convince me to switch to 10g, so currently I have negotiation power :) [09:31:11] klausman: I am going to reimage ml-cache1001 to bullseye, to test if it works or not [09:31:55] isaranto: I have some ideas for ores-legacy, but I'd need to experiment a little with Istio. One thing that we could do is to have a separate set of Istio gateways, separated from the inference ones [09:32:11] so they'd listen to a different port, and we'd have a separate endpoint for ores legacy [09:32:23] ack [09:33:09] one thing I need to do is add https in the application. I'll need to include configuration for the certificates directory [09:34:30] 10Machine-Learning-Team, 10SRE: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [09:35:12] isaranto: in theory it is handled by istio, not sure if you need to do anything beside config [09:35:51] hmm I'll have to consult with fastapi. as far as I remember I need to set a path [09:36:07] will do it afterwards then to be sure [09:37:03] (03CR) 10AikoChou: "Thanks for the review! Please let me know if I misunderstood anything." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/907923 (https://phabricator.wikimedia.org/T331401) (owner: 10AikoChou) [09:38:28] isaranto: so istio gateways will terminate TLS, and contact fast api via http, so the only unencrypted traffic will be between k8s nodes in the same cluster [09:38:41] not sure what service ops does, for as initial use case it should be fine [09:39:03] aa ok then [09:39:04] elukey: ack, re: ml-cache1001 [09:48:44] klausman: I think that for ores-legacy (and possibly any other future services) we'd need to have a separate set of istio gateway pods, listening on another port (not 30443) [09:49:03] so we can keep the inference endpoint separated [09:50:45] the main issue right now is that ores-legacy is deployed on the same istio gateway pods that inference uses, so we have to use the same vip as well [09:50:56] that is very confusing, I think we should keep things separated [09:54:37] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye executed with errors: - ml-cache1001 (**FAIL*... [09:55:17] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [09:59:24] elukey: agreed. I just hope doing it separately isn't a ton of work [10:00:11] a little bit, but it shouldn't be that hard [10:09:36] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye [10:12:16] (03PS12) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [10:14:14] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [10:18:59] * elukey lunch! [10:36:58] * isaranto lunch as well [12:27:21] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1001.eqiad.wmnet with OS bullseye completed: - ml-cache1001 (**PASS**) - Remo... [12:27:52] ok ml-cache1001 is on bullseye [12:28:36] moritzm: o/ so cqlsh works fine, the only nit is that cassandra doesn't use a fixed uid/gid so when starting the instances after the reimage a chown -R cassandra:cassandra is needed [12:29:22] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [12:39:58] elukey: nice, I guess the chown is fine, given that it's really just a reimage one time step [12:45:34] moritzm: yeah also we need to explicitly allow the service unit to work, so not a big deal, but a fixed uid/gid would be nicer :) [12:57:27] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10fgiunchedi) [12:58:51] I mean we can do it like for the Hadoop installs [12:59:18] assign a fixed one in data.yaml and then apply it for cassandra installations starting with the bullseye reimages [12:59:30] could be an option yes [12:59:38] along with a one time sync to the fixed value [12:59:58] we can ask to Eric what is the preferred way, not a big trouble [13:04:58] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: - ml-cache1002 (**WARN**) - Down... [13:32:43] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Jdforrester-WMF) >>! In T332953#8809416, @tstarling wrote: > Will the GitHub mirrors be switched over to replicate from GitLab? Thi... [13:33:23] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye [13:39:08] ml-cache1002 reimaged, doing 1003 as well [13:39:52] klausman: do you want to do codfw during the next days? [13:48:28] (03PS13) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [13:49:54] (03CR) 10CI reject: [V: 04-1] feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) (owner: 10Ilias Sarantopoulos) [13:50:54] I'm getting used to being rejected --^ [13:50:56] :) [13:51:41] isaranto: everything is docker based, maybe there is a way to run it locally [13:51:55] (not that I mind seeing -1s, it is more for your dev cycles) [13:52:00] (03PS14) 10Ilias Sarantopoulos: feat: use Lift Wing instead of ORES [extensions/ORES] - 10https://gerrit.wikimedia.org/r/910439 (https://phabricator.wikimedia.org/T332953) [13:53:51] yes you're right. I setup codesniffer and formatter but didn't do it for the tests [13:55:30] finally green light :) [13:55:42] \o/ [13:55:58] I am working on the ores-legacy istio setup, hope to have something ready during the next days [13:56:09] cool [13:58:27] I have work to do anyway with the thresholds thingy. I 'll probably add the same functionality to ores-legacy endpoint after the extension [13:59:17] yep yep, with the lvs endpoint it will likely be fixed during next week, but I'd like to have it ready asap so you can deploy/test/fix/etc.. without interruptions [13:59:37] I won't join the meeting as I have to go in approx 15-20' [14:05:55] elukey: can do. I think my ISP troubles are mostly over. Turns out, broken DHCPv6 can really uruin your day. [14:10:01] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: - ml-cache1003 (**WARN**) - Down... [15:03:45] 10Machine-Learning-Team, 10SRE, 10Patch-For-Review: Migrate ml-cache to Bullseye - https://phabricator.wikimedia.org/T331712 (10elukey) a:03klausman The eqiad cluster is on bullseye, these are the steps needed after a reimage to make a node work again: ` elukey@ml-cache1003:~$ sudo chown -R cassandra:cass... [15:51:57] * elukey afk! [15:52:01] have a nice rest of the day folks! [20:53:21] (03PS1) 10Umherirrender: Replace deprecated Hooks::run [extensions/ORES] - 10https://gerrit.wikimedia.org/r/912983 (https://phabricator.wikimedia.org/T335536)