[13:35:42] Krinkle: hey, since March 30th, the refreshLinks job on miraheze has had awful performance [13:35:52] and is frequently hitting memory limits [13:35:59] causing huge backlogs [13:38:38] RhinosF1: ok, what version of mw? I assume you suspect a software bug, why? [13:40:37] Krinkle: 1.37.2 but would have been .1 when it started [13:40:51] and I haven't been able to get any hint at a cause [13:41:14] ok. were there any deployments or config changes on the day of? are there server graphs? [13:42:38] Krinkle: yes, various extension updates. and yes grafana.miraheze.org under 'MediaWiki' and 'Miraheze Cluster' [13:42:51] Fatal error: Allowed memory size of 157286400 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/w/includes/libs/objectcache/MemcachedPeclBagOStuff.php on line 341 [13:42:58] is what was happening a lot [13:43:15] so I let them run on mwtask* with more memory rather than mw* [13:43:21] but it's far too slow [13:43:42] like 4 jobs a minute was the rate last night [13:50:03] RhinosF1: I'm not sure. could be anything I suppose. If it's a particular page or template causing a bug, it should eventually stop retrying the same thing and move on right? What kind of jobrunner do you use? Does it correctly consider http 500 as failure and not retry indefinitely? [13:50:23] Do the log messages indicate it's a different job each time, or the same type+ same parameters? [13:50:40] the job description should give you details of what job exactly is failing, e.g. for which template/page. [13:51:11] it may not be under the fatal error message in logstash but under the JobQueue/JobExecutor/JobRunner channels etc. best to look up by the reqId of one of the memory errors and then see all messages (don't filter by channel). [13:53:27] Krinkle: they've always been constantnoblewiki [13:53:47] https://www.irccloud.com/pastebin/tjVO5pdO/ [13:54:13] and it's the old redis jobrunner [13:59:35] RhinosF1: and same template or same pages or same root? [14:07:08] Krinkle: found another example same template but let me pull some logs [14:12:31] https://spiritsofargonothwiki.miraheze.org/w/index.php?title=Template:Npc_infobox/drop&action=history has been thrown up too [14:12:45] that has a lot of slow parse warnings too [14:13:03] for various pages with it's req id Krinkle [14:16:44] MediaWiki\Storage\DerivedPageDataUpdater::prepareContent: using stashed edit output... shows for 2 of them too [14:18:24] Krinkle: ^ [14:18:36] that's on a few reqID's as first entry [14:18:49] htmlCacheUpdate timing out reaching the URL seen twice too [14:23:31] RhinosF1: I'm afraid I can't help you further. Note that on its own, out of memory errors aren't reflective of a bug per-se. It's well known that we allow creation of pages that expand via templates to something larger than we support. In theory we should limit that in a predictable way and serving a user warning at saving. Instead they get caught by the memory limit in some cases. ref T254522, T75960 [14:23:33] T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM - https://phabricator.wikimedia.org/T254522 [14:23:33] T75960: Limit the number of media files used on a wiki page - https://phabricator.wikimedia.org/T75960 [14:25:31] Krinkle: so what do we about job queue being crappy due to backlog from it [14:25:54] like surely it should not be stuck constantly crashing on the same few thousand jobs [14:26:50] RhinosF1: are they? are they the same pagetitle/rootSig? [14:27:07] A job is typically retried at most 3 times [14:28:19] Krinkle: where would try 3 times be set [14:29:19] I guess https://github.com/miraheze/puppet/blob/master/modules/mediawiki/templates/jobrunner.json.erb#L25 but https://github.com/miraheze/puppet/blob/master/modules/mediawiki/templates/jobrunner.json.erb#L27 means it wait's a day between attempts [14:29:54] no 1 hour [14:30:00] i can't do mental maths [14:30:37] Job::allowRetries(), JobQueue ['maxTries'] = 3 [14:30:52] its the default, unless wg jobqueue config changes it [14:31:39] right [14:32:25] Krinkle: i'm pretty sure my answer ends up at it's not fit for purpose and we need to give it more resources [14:33:17] RhinosF1: I'd recommend looking at the wiki pages that fail to parse (assuming it fails during parsing) and see if they are reasonable or not. [14:33:29] e.g. not a user sandbox with hello-world template recursibvely repeated 3000 times for fun [14:33:36] those are fine to fail [14:33:46] also, the memory fail should itself not be causing an issue with backlogs [14:34:06] if the queue is getting backlogged overall, likely something else is the problem and the OOM is either a side-effect or unrelated [14:34:15] Krinkle: I had a look at some and they didn't seem overly bad. I think that wiki is page forms [14:34:21] the job queue is under resourced [14:34:34] it's not uncommon for it to be slow [14:34:53] it's always been something that falls apart every 6 months [14:36:17] ok, well, I would pinpoint why it's happening, e.g. a problem with unreasonable specific tasks getting stuck vs normal tasks causing a problem due to concurrency or something else. [14:36:45] it could be a case of https://phabricator.wikimedia.org/T298708 Krinkle [14:36:46] we have out of memory errors all the time at WMF from refreshLinks. some pages just aren't considered valid and won't render right now. [14:42:21] yeah [14:42:31] our servers aren't resourced for extremes Krinkle [14:42:41] i think I need to go ask for more capacity [14:43:26] ack, in principle they should fail quickly and move on. it wouldn't take more or less memory. The processes reserve their memory either way so more OOMs shuld not even affect concurrency much. (I could be wrong here, not sure how much PHP overcommits in practice etc.) [14:44:06] Of course, if you find that "normal" pages are no longer rendering, e.g. same error when browsing normally in a browser, then that could be a memory problem indeed [14:45:23] sadly when they are at the limit they take longer to fail [14:46:29] a thousand jobs taking 15 seconds on average twice is still 8.3 hours [14:54:16] ack