[06:10:24] <wikibugs>	 10Analytics, 10observability: Need a list of AQS Kibana dashboards and searches - https://phabricator.wikimedia.org/T285318 (10elukey) +1 :)
[06:26:50] <joal>	 Good morning
[07:57:21] <elukey>	 !log execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU
[07:57:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:01:17] <elukey>	 !log reboot an-worker1101 to unblock stuck GPU
[08:01:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:09:26] <elukey>	 an-worker1101 up :)
[08:09:35] <elukey>	 apologies if it caused alarms in alerts
[08:09:56] <elukey>	 I saved the dmesg with all the amd driver horror in my home dir
[08:10:29] <elukey>	 Aiko and Miriam are trying to schedule multiple jobs at the time on the GPUs via tensorflow and ROCm doesn't like it
[08:15:10] <miriam>	 :) yeahh ^
[08:16:08] <joal>	 Being able to schedule multiple jobs on the resource is a known issue of using labels on yarn :(
[08:20:16] <elukey>	 joal: but we don't know why
[08:20:33] <elukey>	 if it is a tensorflow-only problem or a ROCm issue
[08:21:09] <elukey>	 we had the same problem on stat100x, from the horror in the logs it looks like saturating memory on the GPU is a problem for the drivers (even radeontop hangs)
[08:21:21] <elukey>	 but in theory it should be a little more graceful
[08:21:34] <joal>	 elukey: Im my mind, scheduling multiple jobs on the same GPU is not a good idea - very possibly I'm wrong and systems should be able to do it, but I always thought it was a bad idea
[08:22:06] <elukey>	 joal: could be the problem yes, I am wondering if nvidia works the same or not
[08:22:13] <joal>	 I have no idea
[08:22:29] <joal>	 gehel: Good morning
[08:22:38] <joal>	 gehel: would you have a minute for a java sinner?
[08:23:01] <elukey>	 anyway, aikoChou and miriam, we can do some tests but if the issue keeps happening let's document the limitation and work one job at the time. does it make sense?
[08:23:13] <elukey>	 (we'll also have the same issue on kubeflow)
[08:23:23] <gehel>	 joal: for you always!
[08:23:46] <gehel>	 joal: meet.google.com/wst-wiao-ono
[08:25:02] <elukey>	 gehel never said something like that to me
[08:25:06] * elukey cries in a corner
[08:25:10] <elukey>	 :D
[08:25:28] <elukey>	 but I know that Joseph is Joseph, I cannot compete
[08:25:29] <gehel>	 that's the cabale of the French speaking people
[08:26:13] <aikoChou>	 elukey Got it, thanks!(y) (y) :] 
[08:27:49] <miriam>	 joal: yes probably a bad idea :D but we tried on the stat machines and it worked pretty well (we had to share that resource across multiple researchers) - in theory tf has functions to allow sharing GPU memory
[08:30:11] <miriam>	 elukey:  the team of GPU-breakers is growing  :D 
[08:31:22] <miriam>	 elukey - one question I have is: to avoid people scheduling more than one job at the time, there should be a way for users to see that the GPUs are already in use. Is grafana the way to go? should we document to double check Grafana before running such jobs?
[08:58:19] <elukey>	 miriam: here I am sorry
[08:58:59] <elukey>	 miriam: so the "solution" that me and Joseph came up with a while ago was to initially allow only one job at the time in the Yarn queue that is able to use GPUs
[08:59:04] <elukey>	 the "fifo" one
[08:59:40] <elukey>	 it is not ideal, but unless people explicitly set a single job to run concurrent things on the same GPU, we should be covered
[08:59:51] <elukey>	 the documentation should highlight this if possible
[09:00:09] <elukey>	 it is not a great deal in my opinion, but it is a compromise until we know better
[09:00:16] <elukey>	 does it make sense?
[09:00:21] <miriam>	 Ooh yes ok, sorry I missed that part! Then if that is easy to implement at that level, no need to point people to Grafana!
[09:00:31] <miriam>	 yes makes sense elukey :)
[09:00:54] <miriam>	 thanks!!
[09:01:27] <elukey>	 thank you and aikoChou for the work :)
[09:16:50] <wikibugs>	 (03PS1) 10GoranSMilovanovic: T277564 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/701494
[09:17:05] <wikibugs>	 (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T277564 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/701494 (owner: 10GoranSMilovanovic)
[14:07:55] <bearloga>	 a-team: urgent question - how long does the monthly per-domain unique devices query take on average? like, if I were to re-count the past two months with some different dimensions, how long would I expect those jobs to take?
[14:11:56] <milimetric>	 bearloga: trying to find out
[14:12:40] <bearloga>	 milimetric: thank you for looking into it!
[14:13:16] <milimetric>	 bearloga: so you're talking about this thing, right? https://github.com/wikimedia/analytics-refinery/blob/master/oozie/unique_devices/per_domain/monthly/unique_devices_per_domain_monthly.hql
[14:13:41] <bearloga>	 milimetric: yep
[14:14:55] <milimetric>	 it's not gonna be quick, I'm looking through hue now to get you a better estimate but I'm guessing more than an hour less than a day
[14:16:35] <bearloga>	 probably helps that it's not running on top of webrequest so most of the heavy lifting is done by the pageview actor job, huh?
[14:19:24] <milimetric>	 bearloga: looks like about 5 hours: https://hue.wikimedia.org/hue/jobbrowser/#!id=0057991-210426062240701-oozie-oozi-W
[14:19:27] <milimetric>	 (per month)
[14:19:45] <milimetric>	 yes, pageview_actor helps a lot, and so does parquet and filtering by stuff like is_pageview
[14:19:58] <bearloga>	 milimetric: thank you SO MUCH! <3
[14:20:31] <milimetric>	 bearloga: np (I checked a few more months and it's the same, about 5 hours per month)
[14:21:03] <milimetric>	 now I'm not sure what the other dimensions you want to include are, but I don't think that would change the timing too much
[14:24:22] <bearloga>	 that's what I'm thinking too
[16:00:35] <wikibugs>	 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10mforns) @EBernhardson Hi! I'll add this task to the Analytics Phab board as well. I believe this is something we will bump into at...
[16:12:59] <wikibugs>	 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson) I believe this is resolved with the introduction of canary events for all datacenters to eventstream config. With ca...
[16:19:33] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: EventLogging background queue beforeunload event handler blocks Back-Forward cache - https://phabricator.wikimedia.org/T285220 (10DLynch) 05Open→03Resolved a:05odimitrijevic→03Gilles
[16:58:26] <wikibugs>	 10Analytics, 10Discovery-Search: Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10mforns) @EBernhardson makes sense! On my side, feel free to close the task, then! If necessary, we can think of this in the future...
[18:01:48] <mforns>	 heya milimetric :] yesterday you said we could meet today for xcoms, now I realize today is friday, so whatever you prefer, I don't mind at all meeting today, the opposite, lmk!
[18:03:46] <milimetric>	 mforns: let's chat!  Gimme a couple min
[18:04:01] <mforns>	 cool
[18:06:12] <milimetric>	 ok, mforns omw cave
[18:37:34] <wikibugs>	 10Analytics, 10Discovery-Search (Current work): Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson)
[18:38:04] <wikibugs>	 10Analytics, 10Discovery-Search (Current work): Airflow dags depending on eventgate events not able to detect data availability during DC switchover - https://phabricator.wikimedia.org/T262326 (10EBernhardson) a:03EBernhardson