[07:03:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:10:25] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) [07:45:52] Good morning team! [07:58:48] o/ [08:04:10] so the gpu-tester docker image, containing only tensorflow and ROCm stuff, is around 8GB in size [08:04:14] /o\ [08:04:22] still not working, missing bits and pieces [08:05:02] so we'll need to be clever for other docker images, to share as many layers as possible [08:22:34] Morning [08:23:16] elukey: \o 8G? even between TF and ROCm? Or is one of the two dominating? [08:28:33] klausman: ROCm for sure, I am trying not to deploy all the things [08:28:35] see this [08:28:36] elukey@dse-k8s-worker1001:~$ du -hs /opt/rocm-5.4.0/ [08:28:36] 13G /opt/rocm-5.4.0/ [08:28:42] Ouch [08:28:57] tf-rocm is around a gig or more IIRC [08:33:10] I am pretty sure that CUDA is a similar horror :D [08:34:34] Oh, absolutely [08:38:03] I would also suspect that the chance of trimming rocm down is higher since it's not quite as singular-blobby [08:42:37] FYI, I fixed orespoolcounter2004, this was caused by the ens5->ens13 interface rename we see on VMs which have been rebooted after the switch of the KVM machine type [08:43:00] moritzm: I completely forgot to check yesterday, thanks a lot! [08:43:11] Ah, the wonderful legacy of "consistent interface names" :D [08:44:23] klausman: we should do a better job in patrolling alerts.wikimedia.org, yesterday Moritz alerted us as well and we both didn't do anything :) [08:44:47] Yeah, I should make it a pinned tab in my browser [08:46:41] no, for some reason these didn't show up in alerts.w.o (and also not icinga.w.o/alert) [08:47:09] I have no idea why, I pinged #wikimedia-observability to have a look [08:47:09] Well, there are a bunch of secondary alerts for ores2004 [08:47:21] orespoolcounter2004* [08:47:33] e.g. CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. [08:47:34] yeah, there was also a secondary alert for irc2001 (which had the same issue) [08:47:45] but we used to have explicity host down alerts in the past for sure [08:48:02] I guess that this may be part of the sprint week [08:48:12] Those would also be useful to avoid alert storms for the host [08:50:13] yeah, we need a middleground between alert storm and alert lull :-) [09:06:14] klausman: do you have time to prepare and follow the switch maintenance in row D later on? [09:06:29] yeah, can do [09:06:35] thanks :) [09:19:07] crw-rw---- 1 root 106 242, 0 Apr 18 08:43 kfd [09:19:10] ufff [09:20:13] so i guess that 106 is the render group [09:22:25] of course the ROCm upstream image for tests runs everything as root [09:22:28] to keep it simple [09:27:32] but [09:27:33] crw-rw---- 1 root video 226, 1 Apr 18 08:43 /dev/dri/card1 [09:32:55] is video==106? [09:33:43] nope, on dse-k8s-worker the kfd is assigned to root:render, /dev/dri/card{0,1} to root:video [09:33:54] Hrm. [09:34:16] So whatever is running inside of it would need to be in both groups or run as root. [09:36:01] I think that render is sufficient, I see that on stat100x we have users only in 'render' [09:36:28] I think that a user goes through /dev/kfd to access the GPUs [09:37:25] but in our bullseye base image the render group is not there [09:38:31] I am unsure about Debian GIDs: are they dynamically allocated for system groups, or is there a set plan? [09:38:58] (my Bookworm install here has 106 as netdev, so I suspect it's dynamic) [09:39:27] we have fixed gids in production, but not for docker images (only fixed uids) [09:40:43] So would we create the user with the "expected" GID and then install the rest of the image? [09:42:44] (side note: orespoolcounter2004 now has no alerts firing in Icinga/alerts.w.o) [09:42:48] in theory I'd say that a simple `groupadd render` + `usermod -a -G nobody render` may work, but I'd say that we should talk with service ops to know what are the plans for docker-pkg [09:43:02] Agreed [09:43:10] we have known_id_mappings in its python code, maybe something similar for groups would be good [09:43:33] I'd just be wary of breaking something that assumes entirely dynamic/hardcoded GIDs [09:52:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [09:53:28] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [09:56:20] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ArielGlenn) [10:13:37] <- Lunch and an errand (new glasses!) [10:17:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10eoghan) [10:20:17] 10Machine-Learning-Team, 10GitLab (Project Migration): Move add-a-link to gitlab - https://phabricator.wikimedia.org/T334605 (10kostajh) Sure, I'm around and happy to help fix/diagnose issues. [10:23:47] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:26:35] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:31:00] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [10:37:54] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kostajh) >>! In T308133#8788082, @Tgr wrote: >>>! In T308133#8774966, @Sgs wrote: >> I'm still investigating this; the configuration i... [10:38:04] ok the gid 106 is the one from the underlying dse-k8s-worker, didn't think about it [10:38:17] so it gets mapped as it is [10:38:55] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10jbond) [10:40:38] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10Ladsgroup) We don't switchover misc databases: https://orchestrator.wikimedia.org/web/cluster/alias/m2 The replica exists but doesn't... [10:44:02] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 (10kostajh) >>! In T308133#8788818, @Ladsgroup wrote: > We don't switchover misc databases: https://orchestrator.wikimedia.org/web/cluste... [11:02:03] * elukey lunch [11:51:12] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row... [11:54:24] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [11:58:10] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [11:58:44] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10hnowlan) [12:21:26] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ssingh) [12:25:35] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row... [12:27:28] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row... [12:27:42] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row... [12:39:11] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10Ottomata) > Would it be possible to provide a unified stream like revision-score that has all the relevant model scores? @prabhat, interesting. In {T331401}, we are design... [13:11:26] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7fc7ae6f-d3b2-43ed-b030-194ed6367c80) set by cmooney@cumin1001 for 2:0... [13:12:10] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) [13:17:03] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e714b564-285e-4f22-b860-267d7c23208d) set by cmooney@cumin1001 for 2:0... [13:17:47] klausman: o/ maintenance is about to start [13:17:58] aye, am aware [13:17:59] I put the wrong date in the gcal, just realized it [13:18:28] klausman: sure but the ores nodes are still pooled afaics [13:18:30] I thought the extra 1h was for prep :) [13:18:36] https://config-master.wikimedia.org/pybal/eqiad/ores [13:19:27] depooled. [13:19:36] thanks :) [13:19:40] I had the command typed up but forgot to hit return %-) [13:21:43] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10klausman) [13:42:13] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Marostegui) dbproxy[1016-1017] reloaded [13:50:38] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10klausman) [13:52:42] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10MoritzMuehlenhoff) [14:01:21] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) [14:54:11] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [15:05:19] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10elukey) >>! In T330854#8786808, @prabhat wrote: > @elukey Thanks again. > > 1. Regarding caching, I have discussed with the team. @HShaikh will set up a meeting for us to d... [15:08:07] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [15:11:27] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10Ottomata) > There could be different solutions, Or, have change prop (or other processor), request multiple LiftWing model endpoints for each page change, and add each resul... [15:12:17] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 9th round of wikis - https://phabricator.wikimedia.org/T308134 (10Trizek-WMF) Any update? [15:13:58] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10elukey) >>! In T330854#8790077, @Ottomata wrote: >> There could be different solutions, > Or, have change prop (or other processor), request multiple LiftWing model endpoint... [15:14:29] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [15:16:58] * elukey taking a break [15:35:34] 10Machine-Learning-Team, 10API-Portal, 10Platform Team Initiatives (API Gateway Roadmap): Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10JArguello-WMF) [15:37:40] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10Ottomata) Aye makes sense. But hm, the maintenance of streams should belong to the producers (data product owners) of those streams (Event Platform is trying to make this p... [15:38:37] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [15:39:07] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [15:54:50] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [15:55:04] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [15:58:34] 10Machine-Learning-Team, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10MSantos) [16:00:17] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:00:31] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:01:18] 10Machine-Learning-Team: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) Building a test image is turning up to be a difficult job, mainly due to permissions of the devices exposed by the k8s plugin. For example, on dse-k8s-worker1001 we have the following dev... [16:02:38] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:03:57] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:04:20] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:08:18] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:08:37] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ops-monitoring-bot) cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: End of m... [16:29:44] going afk folks! [16:34:20] I opened https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/39 to upstream in case they have suggestions (I have very few hopes that they'll answer but..) [16:34:41] 10Machine-Learning-Team: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 (10elukey) Opened https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/39 to upstream to get some feedback. [16:39:44] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10elukey) Definitely we would maintain the streams, but from users of a system (at least this is my personal view). For example, it would be great if the ML team wouldn't nee... [16:42:11] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10elukey) >>! In T330854#8790216, @Ottomata wrote: > I ask because it would help us answer the question in T331401#8690845: It would really help if a data product owner would... [16:54:43] 10Machine-Learning-Team, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10hnowlan) [17:03:57] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10cmooney) 05Open→03Resolved All works complete, no issues to report. [17:14:49] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10VirginiaPoundstone) [17:35:52] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite) [17:37:41] 10Machine-Learning-Team, 10ORES, 10Wikimedia Enterprise: Investigate tools that use ORES - https://phabricator.wikimedia.org/T330854 (10Ottomata) Ah great, yes then. EP aims to make it easy for teams to build, deploy, and maintain simple streaming enrichment jobs (like this). We are dogfooding this with pa... [22:05:36] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Eevans) I can be the point of contact for [[ https://gerrit.wikimedia.org/g/mediawiki/services/kask | mediawiki/services/kask ]], a...