[06:10:24] 10Analytics, 10observability: Need a list of AQS Kibana dashboards and searches - https://phabricator.wikimedia.org/T285318 (10elukey) +1 :) [06:26:50] Good morning [07:57:21] !log execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU [07:57:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:01:17] !log reboot an-worker1101 to unblock stuck GPU [08:01:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:09:26] an-worker1101 up :) [08:09:35] apologies if it caused alarms in alerts [08:09:56] I saved the dmesg with all the amd driver horror in my home dir [08:10:29] Aiko and Miriam are trying to schedule multiple jobs at the time on the GPUs via tensorflow and ROCm doesn't like it [08:15:10] :) yeahh ^ [08:16:08] Being able to schedule multiple jobs on the resource is a known issue of using labels on yarn :( [08:20:16] joal: but we don't know why [08:20:33] if it is a tensorflow-only problem or a ROCm issue [08:21:09] we had the same problem on stat100x, from the horror in the logs it looks like saturating memory on the GPU is a problem for the drivers (even radeontop hangs) [08:21:21] but in theory it should be a little more graceful [08:21:34] elukey: Im my mind, scheduling multiple jobs on the same GPU is not a good idea - very possibly I'm wrong and systems should be able to do it, but I always thought it was a bad idea [08:22:06] joal: could be the problem yes, I am wondering if nvidia works the same or not [08:22:13] I have no idea [08:22:29] gehel: Good morning [08:22:38] gehel: would you have a minute for a java sinner? [08:23:01] anyway, aikoChou and miriam, we can do some tests but if the issue keeps happening let's document the limitation and work one job at the time. does it make sense? [08:23:13] (we'll also have the same issue on kubeflow) [08:23:23] joal: for you always! [08:23:46] joal: meet.google.com/wst-wiao-ono [08:25:02] gehel never said something like that to me [08:25:06] * elukey cries in a corner [08:25:10] :D [08:25:28] but I know that Joseph is Joseph, I cannot compete [08:25:29] that's the cabale of the French speaking people [08:26:13] elukey Got it, thanks!(y) (y) :] [08:27:49] joal: yes probably a bad idea :D but we tried on the stat machines and it worked pretty well (we had to share that resource across multiple researchers) - in theory tf has functions to allow sharing GPU memory [08:30:11] elukey: the team of GPU-breakers is growing :D [08:31:22] elukey - one question I have is: to avoid people scheduling more than one job at the time, there should be a way for users to see that the GPUs are already in use. Is grafana the way to go? should we document to double check Grafana before running such jobs? [08:58:19] miriam: here I am sorry [08:58:59] miriam: so the "solution" that me and Joseph came up with a while ago was to initially allow only one job at the time in the Yarn queue that is able to use GPUs [08:59:04] the "fifo" one [08:59:40] it is not ideal, but unless people explicitly set a single job to run concurrent things on the same GPU, we should be covered [08:59:51] the documentation should highlight this if possible [09:00:09] it is not a great deal in my opinion, but it is a compromise until we know better [09:00:16] does it make sense? [09:00:21] Ooh yes ok, sorry I missed that part! Then if that is easy to implement at that level, no need to point people to Grafana! [09:00:31] yes makes sense elukey :) [09:00:54] thanks!! [09:01:27] thank you and aikoChou for the work :) [09:16:50] (03PS1) 10GoranSMilovanovic: T277564 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/701494 [09:17:05] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T277564 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/701494 (owner: 10GoranSMilovanovic) [14:07:55] a-team: urgent question - how long does the monthly per-domain unique devices query take on average? like, if I were to re-count the past two months with some different dimensions, how long would I expect those jobs to take? [14:11:56] bearloga: trying to find out [14:12:40] milimetric: thank you for looking into it! [14:13:16] bearloga: so you're talking about this thing, right? https://github.com/wikimedia/analytics-refinery/blob/master/oozie/unique_devices/per_domain/monthly/unique_devices_per_domain_monthly.hql [14:13:41] milimetric: yep [14:14:55] it's not gonna be quick, I'm looking through hue now to get you a better estimate but I'm guessing more than an hour less than a day [14:16:35] probably helps that it's not running on top of webrequest so most of the heavy lifting is done by the pageview actor job, huh? [14:19:24] bearloga: looks like about 5 hours: https://hue.wikimedia.org/hue/jobbrowser/#!id=0057991-210426062240701-oozie-oozi-W [14:19:27] (per month) [14:19:45] yes, pageview_actor helps a lot, and so does parquet and filtering by stuff like is_pageview [14:19:58] milimetric: thank you SO MUCH! <3 [14:20:31] bearloga: np (I checked a few more months and it's the same, about 5 hours per month) [14:21:03] now I'm not sure what the other dimensions you want to include are, but I don't think that would change the timing too much [14:24:22] that's what I'm thinking too [16:00:35] 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10mforns) @EBernhardson Hi! I'll add this task to the Analytics Phab board as well. I believe this is something we will bump into at... [16:12:59] 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson) I believe this is resolved with the introduction of canary events for all datacenters to eventstream config. With ca... [16:19:33] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: EventLogging background queue beforeunload event handler blocks Back-Forward cache - https://phabricator.wikimedia.org/T285220 (10DLynch) 05Open→03Resolved a:05odimitrijevic→03Gilles [16:58:26] 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10mforns) @EBernhardson makes sense! On my side, feel free to close the task, then! If necessary, we can think of this in the future... [18:01:48] heya milimetric :] yesterday you said we could meet today for xcoms, now I realize today is friday, so whatever you prefer, I don't mind at all meeting today, the opposite, lmk! [18:03:46] mforns: let's chat! Gimme a couple min [18:04:01] cool [18:06:12] ok, mforns omw cave [18:37:34] 10Analytics, 10Discovery-Search (Current work): Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson) [18:38:04] 10Analytics, 10Discovery-Search (Current work): Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson) a:03EBernhardson