[02:04:45] [[Tech]]; AntiCompositeNumber; /* Change of license on WikiNews */ Reply; https://meta.wikimedia.org/w/index.php?diff=27005179&oldid=26996525&rcid=31600484 [07:03:24] [[Tech]]; MGA73; /* Change of license on WikiNews */ thank you; https://meta.wikimedia.org/w/index.php?diff=27005891&oldid=27005179&rcid=31601670 [16:39:09] is there any way to get a list of what processes/messages/jobs are being executed and for how long on production job queue runners? [16:39:45] we seem to have a backing-up of things on webVideoTranscodePrioritized, i wanna check if any of these are runaway processes [16:39:58] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&viewPanel=15&from=now-15d&to=now [16:42:20] bvibber: not an easy one, no [16:42:26] :D [16:43:00] We can check if there are ffmpeg processes that have been running for more than x time [16:43:54] but matching that to an actual job :/ [16:44:18] claime: dumb question, why is it so hard? [16:44:28] cdanis: short answer, shellbox [16:44:57] does it not propagate x-request-id ? [16:46:59] sorry not shellbox, firejail for videoscalers [16:49:24] ahh [16:49:37] we can't dump some metadata into the chroot directory? [16:49:41] cdanis: basically we have a log of the command that the jobrunner runs, but I'm not aware of a way to link that log to the actual process on the videoscaler [16:49:48] right [16:50:53] we probably can, but I don't think we want to invest time doing that when we're working on moving it to k8s [16:51:13] unless it's a simple change, but I don't know enough about the code change it would take [16:51:21] Probably it's not that much [16:56:49] well, it'd be nice to at least see runtimes and command lines for the ffmpeg processes [16:57:10] eg if dozens have been running for 3 days straight that'd be good to know :D [16:57:35] also, a way to kill those processes. [16:58:30] i suspect some of those 4K videos are either getting stuck or just taking a LONG TIME :D [17:03:33] bvibber: so err, I can cumin a dirty ps -e -o pid,etimes,command | grep ffmpeg | awk '{if($2>259200) print $0}' to get that :p [17:03:41] And there's only one [17:03:43] ooh [17:03:48] interesting [17:04:10] anything > 86400? [17:04:17] maybe it got past the slowest ones but it's still catching up heh [17:06:15] there's more yeah [17:07:03] 18, all on the same node [17:10:30] <_joe_> claime: still linked to a parent pid? [17:10:42] checking [17:11:00] <_joe_> because php should 100% timeout and kill the ffmpeg [17:12:01] ERROR: commit 02b9104: email address bvibber@wikimedia.org is not registered in your account, and you lack 'forge committer' permission. [17:12:02] o_O [17:12:08] wait i might've messed something up [17:12:53] yep ssh url is old account lol [17:13:14] _joe_: yeah they still have a scripts/ffmpeg-encode.sh parent [17:13:32] <_joe_> and that has a parent that's a php-fpm process? [17:13:49] <_joe_> ok so the bash script isn't the culprit [17:14:35] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1049608 <- here's a quick hack to temp disable the 2160p and 1440p transcodes until i fix some perf problems and we figure out why it's not timing out the long runs :D [17:19:43] _joe_: No, they have pid 1 as an ancestor, not php-fpm [17:19:59] o_O [17:20:12] php-fpm timing out doesn't kill child processes [17:20:26] they get zombied and adopted IME [17:20:31] *sob* [17:20:35] well that explains that [17:20:39] yeah [17:20:42] (I've only been poking at shellbox, so I'm not sure it applies here, but it sure does happen there) [17:20:45] <_joe_> kamila_: uh that's new [17:20:48] i thought we had a specific timeout on the php end though, maybe i misremember [17:20:51] <_joe_> and bad [17:20:57] there is a timeout inside the bash script [17:21:04] which actually calls /usr/bin/timeout [17:21:07] and that one should work [17:21:19] 921600 is the timeout [17:21:20] <_joe_> kamila_: yeah and it seems not to [17:21:21] so err [17:21:22] (again assuming my staring at shellbox applies here) [17:21:23] oh [17:21:25] <_joe_> uh ok that's a bit high [17:21:31] that will take a minute [17:21:36] <_joe_> kamila_: 100% this is just executed in-place [17:21:39] <_joe_> ok so [17:21:52] ok, so that's shellbox specific then, nvm [17:21:53] <_joe_> kamila_: so to understand better [17:22:03] <_joe_> kamila_: no I was saying [17:22:06] <_joe_> we're still using shellbox [17:22:10] <_joe_> via php-fpm [17:22:17] <_joe_> just within mediawiki [17:22:25] oh, ack [17:22:37] <_joe_> kamila_: just to clarify - you're talking about php-fpm killing its processes [17:22:46] <_joe_> not apache cancelling the request [17:23:13] if the request times out (whether by apache or php-fpm settings), the bash subprocess doesn't get killed [17:24:03] yeah that's pretty apparent [17:24:04] it gets adopted, on my machine by php-fpm master process [17:24:23] (on my machine == in my setup, sorry) [17:24:53] <_joe_> kamila_: uhm ok that is a change in behaviour compared to the past [17:25:09] that is what I observed [17:25:17] <_joe_> maybe with 7.4? anyways [17:25:29] <_joe_> if that's the case we need to shorten the timeout in the bash script [17:25:31] I'm not 100% sure the mw and shellbox images are the same [17:25:33] <_joe_> or allow to override it [17:25:43] <_joe_> they both use 7.4 [17:26:03] IME with the shellbox image the only relevant timeout is the one that's in the command line wrapper [17:26:07] here if I take on of the > 24h ffmpeg processes, the shellbox shell script's parent is pid1, definitely [17:26:21] so they should probably be killed right, nothing will pick them up [17:26:29] I believe so [17:26:32] <_joe_> claime: yeah I assumed it would be the same event for everything [17:26:55] <_joe_> something's not right though, we didn't observe such runaway processes before moving to shellbox's boxedcommand [17:27:05] <_joe_> I fear there's something broken there [17:27:20] <_joe_> anyways, for now I think bvibber's patch + killing the stray processes should be enough [17:27:59] *nod* [17:27:59] <_joe_> now sorry but I'm going to touch grass, in 3 minutes it's 15 hours I'm here [17:28:02] :D [17:28:04] enjoy [17:28:06] <_joe_> err 13 [17:28:09] <_joe_> but still :D [17:29:04] stop typing and go away :D [17:29:12] all killed [17:29:22] \o/ thx [17:29:47] sorry I am a bit late to this discussion and I am not fully aware of the context [17:29:54] but last night, I did this https://sal.toolforge.org/log/zEvZTJABhuQtenzvBhf4 [17:29:58] since it was flapping quite a bit [17:30:11] not sure if it helps in any way but I wanted to point it out since you are talking about all this :) [17:31:26] Now I can go ahead and run a find for orphaned tmp files [17:32:21] sukhe: yeah likely the same problem :D [17:32:24] thx! [17:32:33] ok :) [17:34:55] I got to go, but there shouldn't be any more processes that have been running for more than 24 hours [17:35:28] There's probably a more thorough clean up to do, checking orphaned shellbox processes and killing them [17:35:36] Then clearing the orphaned tmp files [17:35:54] I've gotta run now [17:38:18] thanks claime !