[01:28:41] _joe_: we might be in a significantly better position than we suspected based on preceived/documented operations. https://phabricator.wikimedia.org/T314240 regarding job runtime etc [01:29:13] I'm prepared to hear I've missed or measured incorrectly, I only did a basic check. We can perhaps check it emperically on aon appserver in some way. [06:32:17] <_joe_> Krinkle: uhmm those data seem definitely off to me [06:32:37] <_joe_> unless they're not counting time spent by ffmpeg, I regularly see processes lasting hours [07:26:14] _joe_: these are http latencies seem from nodejs changeprop afaik. There's a ton of indirection both in the statsd/prom mapping and inside the code from hyperswitch/changeprop/cp-jobqueue which makes it hard to tell what is being measured and what metric name it ends up at. So.. I can't actually tell with certainly from just reading the code [07:26:53] And it seems the mw application servers dashboard (non-RED) only has generic fpm metrics and gaps for http latencies [07:27:15] Not sure what it would take to get that enabled and then presumably also visible on the RED dash. [07:29:41] <_joe_> https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=55 seems to say that 0.1% of requests take more than 17 hours, I might need to check that, but we can get better numbers from the JobQueue.log file on mwlog [07:29:51] <_joe_> also, good night!