[08:43:35] @paladox https://github.com/miraheze/mw-config/pull/5356 [08:50:47] @paladox also https://github.com/miraheze/mw-config/pull/5245/files [09:19:57] Also https://github.com/miraheze/mw-config/pull/5357 [15:54:10] @paladox please add #announcements for the 3rd in case of user VE errors [16:02:41] @paladox preWarm jobs are taking up a lot [16:03:15] do note that job is on every edit [16:03:20] on every wiki with VE [16:03:25] i know [16:03:26] eventually every wiki [16:03:46] @paladox we need to be able to run it on every wiki after every edit [16:03:49] like asap [16:07:34] @paladox how is this going to work on all wikis? [16:07:40] by whenever 1.42 comes out [16:08:03] given Parsoid will be a beta feature then [16:08:20] by 1.41, we need it as a developer tool on all wikis for reads [16:09:02] we can't even warm the cache [16:09:32] well then we carn't have it simple as with warmCache, [16:09:40] i'm gonna switch warm cache off [16:10:46] done [16:14:14] we can't deploy 1.41 globally without warmCache enabled [16:14:26] we carn't have warmCache [16:14:58] It isn't optional [16:14:58] works on test131 so... [16:15:16] There is zero chance we are going to be able to handle performance as parsoid rolls out [16:16:06] we saw how slow things got with SCSVG on a cold cache [16:16:23] well we either have a bug jobqueue and deal with it or we carn't and carn't upgrade. [16:16:41] parsercache has it's own group so own runner [16:17:13] we'll need more runners than 1 [16:18:21] we don't have the resources to do that. mwtask only has 4 cores and it's already quite conjested. We have a lot of runners with different groups already. I can increase to 2 but we carn't have many. [16:22:28] @paladox we likely need 4 based on my estimations. We can try with 2 and see if it keeps climbing. [16:22:50] ok [16:22:55] but can we not increase cores/memory or another mwtask [16:23:09] 4 cores to handle parsing for 7k wikis is not a lot [16:33:03] @paladox ye this is a disaster [16:33:28] Can we borrow some resources from Cloud12? Its only running at 13% per Grafana [16:33:40] it needs to process an extra about 350 jobs a minute [16:33:50] we need more capacity for intensive jobs [16:33:56] What’s going on with VE?.. [16:34:28] @pixldev we've deployed a change to how it accesses the parser [16:34:32] has something broke? [16:34:57] Ah, unforeseen bug? [16:35:00] 🐛 [16:35:07] @pixldev no, it is planned [16:35:14] Ah [16:35:17] it is prepartion for 1.41 [16:35:30] Ah [16:35:43] VE no longer uses the rest interface because parsoid is becoming part of core [16:35:52] I would have thought 1.41 was still a while out [16:35:58] no, fairly soon [16:36:08] Parsoid is a new way of parsing content [16:36:25] With 1.41, you'll have a new tool on your wikis to see how content will be parsed in future. [16:36:38] Interesting [16:36:45] With 1.42, it will be a beta option and potentially on by default on some wikis [16:36:59] and from 1.43 onwards, slowly rolling out [16:37:17] currently we are struggling to warm cache [16:37:35] @paladox can we leave the jobs a bit and see how many are from reads and how many are from edits [16:37:49] the backlog might go insane but every read and every edit is generating at the moment [16:37:54] it'll be every edit soon [16:37:58] as the cache will be warm [16:38:00] Ah thanks. I may not understand everything fully but appreciate it regardless [16:38:45] @paladox I say we wait a few hours and see how bad it gets. [16:38:51] ok [16:38:55] or at least leave it on for until you sleep tonight [16:39:10] but I do think we'll need more capacity eventually @paladox [16:39:26] is there any scope for more cores / memory / another task instance [16:40:16] We've oversubscribed the cores already. Whilst we have the memory, we don't have the disk space [16:40:19] @paladox unless I say, please leave the job until you sleep tonight but please make sure the last thing before you don't have access for the night is disabling it. [16:40:35] disabling? [16:40:57] yes, do not leave preWarm on without a sysadmin with access available within 1-2 hours. [16:41:09] until we're happy with performance [16:41:18] I want to see if it is just from it being cold [16:41:25] or geniunely this terrible [16:41:29] ok [16:41:50] @paladox can we get more disks? [16:42:01] do we have the space to buy more? how much would they cost? [16:42:57] i don't know right now. Priority is fixing cloud14 disk as it's slow. @owenrb ordered the disks but i don't know what's happening with them rn. [16:43:05] @orduin [16:47:06] @paladox can you manually run jobs on bluepages wiki [16:47:21] of course the wiki with 414k pages is the largest contributor to the backlog [16:48:09] > Fatal error: Allowed memory size of 157286400 bytes exhausted (tried to allocate 20480 bytes) in /srv/mediawiki/w/vendor/wikimedia/parsoid/src/Wt2Html/Grammar.php on line 7000 [16:48:10] hmm [16:49:24] @paladox if that's on bluepages, just run it infinite memory [16:51:37] _is pretty sure bluepageswiki is the issue_ [16:52:39] _is also fairly confident it won't be anywhere near as bad as it looks when the cache isn't completely cold_ [16:55:38] Disks were ordered but no plans to go down - especially now with the plans for Miraheze Limited to cease with Miraheze. Last I heard plans were to look at moving to cloud infrastructure but that was back in June [16:58:01] I thought the outcome was that Cloud was too expensive? [17:03:47] @paladox I am seeing it settle a bit [17:03:56] so that makes me think it's not too bad [17:04:07] but we may need to think about memory limits [17:04:22] let's keep monitoring as it tackles the inital backlog [17:09:07] Unfortunately, Mediawiki is being a bit shitty and changing parser which means every single page on every single wiki has to be reparsed. [17:11:02] @paladox how big is the parser cache table on db131 [17:11:16] that's not where it's being saved [17:11:28] it's being saved i think per db? [17:11:32] it's using db-replicated [17:13:10] @paladox the data is still useful [17:13:43] [1/2] > root@db131:/srv/mariadb# du -sh parsercache [17:13:43] [2/2] > 22G parsercache [17:13:45] it's 22g??? [17:14:05] wow [17:14:09] that's big [17:14:14] we've got 31G left on db131 [17:14:28] we might want to decrease retention [17:15:14] as parsoid parser cache will be similar size [17:15:50] jobs aren't uncontrollable [17:16:23] rainworldwiki is the next possible problem @paladox [17:21:07] @paladox i'm pretty sure this will work [17:21:21] if we get control of the wikis that are huge [17:22:11] i will tell you if we can leave it in place by 9pm [17:22:50] but manual runs of the wikis on https://grafana.miraheze.org/d/GtxbP1Xnk/mediawiki?orgId=1&from=now-1h&to=now&viewPanel=59 would be good @paladox [17:22:57] i am [17:23:00] only on a few [17:23:11] ok [17:39:52] Any tips on what I should start to learn to be an MediaWiki Engineer? [17:41:11] @songngu.xyz how to fight the job queue [17:41:33] @paladox newusopediawiki is a problem [17:42:45] @songngu.xyz the job queue is genuinely an awful pile of [17:43:28] like for the jobrunner? [17:44:30] yes [17:44:38] it is horrid software [17:44:47] that @paladox is currently fighting to control [17:44:51] and it's semi working [17:45:03] ok... [17:45:52] well I know the basic things and stuffs, just a little bit more in the rabbit hole like that [17:47:16] i get "Redis server error: socket error on read socket" quite frequently and i don't know how to fix [17:47:32] it means redis is starting to hate you [17:47:38] and you need to run them a lot faster [17:48:00] get the script running in a loop if you can [17:48:06] so it doesn't get killed [17:48:17] <:skull_c:1137720188607926322> [17:48:38] at least I will only learn to be a MWE- [17:49:27] @paladox pokeclickerwiki and newusopediawiki are the biggest [17:49:32] tackle them first [17:50:39] i don't see memory high on jobchron which is good [17:52:25] @paladox whatever you just did, keep doing it [17:52:37] levels are going down [17:52:38] https://github.com/miraheze/puppet/compare/9c8db35accd9...a146170ee0e2 [17:53:13] @paladox that may have worked [17:53:19] i'm nipping to the shop [17:53:23] hopefully it lasts [17:54:03] i'm off to dinner [17:54:38] Ok [17:55:03] If it carries on trending down, we can leave it overnight [17:55:24] As long as it's mostly down or stable, it's fine [18:08:20] it is much flatter [18:08:26] i will check again at 7 [18:13:08] @Stewards @originalauthority says avid can be deleted as they moved [18:14:05] I would recommend it; there's no point in warming the cache for that wiki particularly since it's moved to WikiTide; there have been no recent edits. [18:14:47] @originalauthority I mean it should have community consent [18:15:25] We have to warm the cache for all wikis [18:15:32] We can't close them for that sole reason [18:18:12] Either community consent or actual dormancy enforcement [18:34:51] We're hosting them on WikiForge, in fact. [18:35:38] same difference [18:38:41] Only mention because they generated enough traffic to require their own personal paid server. Deletion could be a substantial savings for MH. 🙂 [18:40:39] @notaracham we shouldn't charge some wikis [18:40:47] Certain ones have a half a million pages [18:41:03] looks like they're down anyway btw [18:41:17] 90% automated [18:41:43] Up on my side. 🤷 [18:42:01] https://www.avid.wiki/Special:RecentChanges works for you? [18:42:13] Yep [18:42:35] interesting [18:42:48] Oh for sure, that one that uses MH as a backend for their game is a fascinating use case [18:43:03] And for me [18:43:55] works now [18:44:08] and down again [18:44:16] that's fin [18:44:47] same here [18:47:59] @paladox i think we should turn it off for bluepageswiki unless you can control it [18:48:30] Pretty much happening for a lot of big wikis [18:48:48] Don’t think switching it off for bluepages will fix it [18:48:53] the other big wikis are nowhere near as big @paladox [18:49:13] I mean [18:50:47] @paladox yes but in terms of pages [18:50:55] bluepages + att are the biggest [18:51:01] i don't think we can handle warm up [18:51:01] the number of pages is more important [18:51:09] we'll have to disable [18:51:16] @paladox We can, just not all at once [18:51:20] it keeps getting killed for bluepages [18:51:24] can we try turning bluepages + att [18:51:26] off [18:51:29] ok [18:53:17] @paladox are you doing a patch for them wikis or me? [18:53:57] me [18:54:07] ok [18:54:16] We're still a long ways off from formally considering cloud infra. If you could get the disks down to the DC (even if you have to mail them, and we get remote hands to install them), that would be great. [18:54:20] let me know once you've done @paladox [18:55:32] @paladox that patch looks no-op, $disableWarmup isn't used [18:57:01] done [18:57:59] let's see then [18:58:06] huge numbers are expected [18:58:11] we just need them controllable [18:58:30] @paladox can you do your best to run the existing jobs for ATT/bluepages [18:58:38] yes [19:03:21] it's looking fairly stable at moment [19:10:37] @paladox it's dropping [19:10:46] slowly but it is [19:11:32] I'll try and get them sent down tomorrow then [19:11:54] 👍 Thanks! [19:12:26] @paladox could take a day to clear this backlog [19:12:30] it is it's own queue [19:12:42] but as long as it continues going down, we are fine to leave it [19:12:56] I say apart from clearing bluepages + att, we leave it [19:12:59] it should work [19:14:16] unless @orduin has any thoughts [19:15:00] Not really, I've only been loosely keeping an eye on this for about an hour [19:15:27] @paladox is ATT still clearing? [19:15:50] mwtask is slammed [19:15:57] we've overloaded it [19:16:12] @paladox that's fine [19:16:15] it'll settle [19:16:26] afaics, it is having no user facing impact [19:16:33] task is built to slam @paladox [19:17:04] Just not going to do any renames then, just to be safe :) [19:18:44] @orduin it should be fine [19:18:49] the main queues look ok [19:19:27] @paladox it's been trending down for like 15 minutes \o/ [19:19:45] I am happy unless anything changes to leave it overnight [19:23:26] ok [19:31:52] idk about you guys, but logouts seem to be constant today [19:35:07] Have you been getting slow load times or timeouts? [19:36:02] I had one logout today. [19:36:20] Pages seem to be loading ok for me but recent changes is a bit slow to update. [19:42:15] loading is fine [19:44:12] I'm pretty sure the log outs are to do with Varnish or something else, nothing to do with todays events -- since users have been reporting them for the last week or two. [19:46:34] it happens in periods [19:46:45] RC being slow to show changes might be our fault [19:47:04] 3 weeks it's alright and then happens 3 days in a row [19:52:58] @paladox tuscriaturaswiki might need a helping hand [19:53:10] @jph2 which wiki is slow for RC [19:53:13] i carn't at the moment [19:53:21] @orduin [19:53:22] mwtask is too loaded [19:53:42] @paladox are you still running bluepages wiki? [19:53:48] yeh [19:54:21] @paladox you can kill that, it's no longer huge [19:55:00] ATT is now 5th [19:55:07] bluepages is not top 10 anymore [20:00:00] @paladox have you done that [20:00:46] yes [20:00:51] bluepages is empthy [20:01:23] @paladox is task still overloaded? [20:01:49] yes load is 8+ and cpu full [20:02:20] okay [20:02:43] @paladox that is not good [20:03:07] jobs are stable but not dropping [20:03:11] which is fine [20:03:18] but load needs to go down [20:04:35] And we're swapping [20:05:04] @paladox can we do anything to reduce load [20:07:16] ok this is just not sustainable [20:07:28] we're going to have to think about just disabling. [20:07:30] @paladox you should be fine to drop the runners to 3 [20:07:37] the rate is stable [20:07:37] that won't fix it [20:07:58] @paladox what is causing the issue? [20:08:03] memory/procs? [20:08:30] the fact there's like 20+ jobs and the fact that it just keeps building up and up. [20:09:02] @paladox it is not building, it's holding stable [20:09:20] a high number of jobs is not a risk as long as it's stable or decreasing [20:09:41] which has been true for the last 70 minutes [20:10:12] at this rate i'm disabling, it'll be for all wikis. [20:10:51] @paladox there is no negative impact to users and afaics, it is not a growing backlog [20:10:59] there is no reason to disable [20:11:10] reparsing pages will generate a lot of jobs [20:11:10] takes note [20:11:11] ok [20:11:48] @paladox if jobs are running, even if it's very slow, please leave it. [20:11:54] ok [20:11:57] I would rather see load drop [20:12:22] but disabling it everywhere will only make it more urgent down the line [20:12:33] and make it so we do have a user impact [20:12:56] we have millions of pages, it's going to take a while to finish [20:13:09] Is the old MW parse just not working anymore? I’m assuming there’s some reason for having to reparse all pages ASAP. Or that’s my understanding at least [20:13:29] @pixldev it isn't asap, we have about 6 months [20:13:35] the old parser is going eventually [20:13:53] but I'd like it to be done sooner so we can enable the test tools with 1.41 [20:14:01] Ah. The current discussion gave me the wrong impression, my apologies [20:14:07] rather than see lots of issues with 1.42 [20:14:36] it should improve performance for visual editor / discussion tools / flow though [20:14:50] because they use parsoid so it's cache being warm is good [20:16:02] from what i see the issues rn are from the server load of all the jobs reparsing, if its not urgent couldn’t they simply be more spaced out over time? [20:16:28] May I inquire as to what the cache being warm refers to? [20:17:35] mwtask having high load isn't a worry and it's not as simple as being done later. It is being done as pages are read [20:17:48] cache being warm means it is being hit often [20:17:48] oh. oh. [20:17:56] a cold cache is one with nothing in [20:18:30] oh, so in a warm cache theres something actually cached [20:18:47] basically for every page that's been read since we enabled cache warming, it has generated an entry in the cache [20:19:01] eventually, that will only happen on edit + every 10 days [20:20:08] warming it up reparses each so it don’t have to parse when it loads? Is that correct? (probably not knowing my understanding) [20:20:20] yes [20:20:37] once it has been cached once, we will hold it in the cache for between 10-11 days [20:21:31] Ah [20:21:57] i learn so much lurking here [20:23:45] that's fine [20:23:56] i mean advising you is better than staring at graphs [20:25:04] yea [20:25:45] yeesh 141 gonna need a vacation after this.. [20:25:59] mwtask is built for being destoryed [20:26:07] it doesn't impact much for users [20:26:18] (it will make some background stuff less performant) [20:26:29] 141? we still have 141? [20:26:39] mwtask141 [20:26:48] 141 is a cursed number [20:26:54] @theoneandonlylegroom mwtask141 is the active maintenance server [20:27:08] Is it? [20:27:51] Well looking a the 163% CPU System load it may very well be [20:27:56] bad things happened in November 2022 w/ db141 ... [20:28:16] oh? [20:28:20] true horror stories past Halloween lol [20:28:30] trauma [20:28:39] it was 330% at one point today [20:28:46] 💀 [20:28:52] it looks like someone has started doing stuff on xedwiki [20:29:22] Huh? [20:29:22] which is causing that wiki to be top of the list [20:29:26] but load is going down [20:29:40] I am watching https://grafana.miraheze.org/d/GtxbP1Xnk/mediawiki?orgId=1&from=now-1h&to=now @pixldev [20:29:54] that shows you which wiki+job combination is the highest [20:30:03] yeah it went from 200 smt to 160 [20:30:24] ah thanks [20:31:09] was looking at https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?orgId=1&var-job=node&var-node=mwtask141.miraheze.org&var-port=9100&from=1699097065008&to=1699140265008 [20:31:35] https://grafana.miraheze.org/d/GtxbP1Xnk/mediawiki?orgId=1&from=now-1h&to=now&viewPanel=48 is all the jobs currently waiting [20:31:39] it is fairly stable [20:31:46] which makes me happy [20:32:09] yay [20:32:25] grafana is way more complex then i thought lmao [20:32:46] seems to be a theme when hosting a site as large as Miraheze [20:33:08] it is complex ye [20:33:23] there's also a lot of patience with big jobs like this [20:33:46] as long as there's no negative impact and it's running, it's just got to be left alone [20:33:53] we could help it a bit alog [20:33:54] i might look into setting up grafana on a vps just to get a feel [20:34:08] he’s fine [20:34:25] it'll stay high for a while [20:34:38] it should really be a trend alert [20:35:02] @paladox if there's the capacity, help xedwiki along please [20:35:15] if not, we will wait [20:35:39] Hopefully SRE keeps Popcorn on hand [20:35:53] i have a sweet bowl [20:37:10] Nom nom<:nomChocoStrawberry:938647184973365318> [20:48:03] [1/3] sorry. was afk [20:48:04] [2/3] vanatas.miraheze.org [20:48:04] [3/3] takes a few seconds [20:48:55] also been meaning to ask for a while, what do the log messages in SRE(https://discord.com/channels/407504499280707585/808001911868489748/1170463831521235014) saying stuff like deploy config true to all actually mean. Is that from puppet? [20:49:22] @pixldev I wrote that script! [20:49:31] config: true means it is deploying config [20:49:51] it's a json copy of the parameters passed to deploy tool [20:50:01] config is deployed by puppet automatically [20:50:12] which is the deploys when [@] shows [20:50:23] instead of [@] [20:51:04] world & l10n means that the actual mediawiki code and the localisation cache was updated [20:51:39] although I have zero idea why --l10n was passed for that deploy by @paladox [20:53:02] @paladox: add xedwiki to the no list or do something with it please [20:54:14] ok [20:58:55] ah, so when someone manually pushes a change outside of puppet [20:59:18] i kindw get it ty 👍 [20:59:48] yes you can run it manually if you like [21:00:33] @orduin will you be around over the evening? [21:00:49] Whats the script written in? [21:01:15] @pixldev Python [21:01:23] Best language [21:01:25] Somewhat, I'm not keeping that close an eye on things right now though [21:01:35] i def dont say that cause its the only one im fluent in [21:02:16] would it be possible if any wiki gets over 40kB on https://grafana.miraheze.org/d/GtxbP1Xnk/mediawiki?orgId=1&from=now-1h&to=now to manually run jobs or blacklist it from generating more [21:02:38] or seems to be 150% higher than the one below it [21:09:29] @paladox I am not seeing xedwiki go down, I think you'll have to manually run jobs on it [21:09:46] i have [21:10:02] @paladox it is not moving quickly then [21:10:10] has the script crashed? [21:10:18] Not a whole lot i can do about that [21:10:23] ok [21:10:40] yeh it crashed [21:10:42] with redis [21:10:56] @paladox it might be worth putting it in a loop [21:11:44] for like 20 attempts or something [21:14:07] @paladox okay xedwiki is no longer top 10 [21:14:13] ok [21:14:25] now top 5 are all about 30kB [21:19:03] @paladox i would suggest a single foreachwiki runJobs.php --type=parsoidCachePrewarm [21:19:10] it might be enough [21:19:19] redis will kill it a few times on some wikis [21:19:31] but it should blast some of the backlogs [21:20:01] i have runJobs running already [21:20:10] @paladox for what wiki(s)? [21:20:15] all [21:20:35] @paladox how many open? just the one? [21:20:43] 2 [21:20:52] ok [21:21:01] let's see how it does overnight then [21:21:11] @orduin can monitor if they crash once you sleep [21:21:31] i'm gonna head off to watch some tv [21:21:37] ok [21:22:05] i'm going to get a shower and then sleep [21:26:09] I strongly suggest when the disks arrive that @orduin makes another mwtask [22:01:00] ManageWiki is winding me up [22:12:00] Might have to turn off platproject2wiki, it's hanging out at around 60k even with a extra runJobs [22:21:52] Do it then [22:22:01] I am off to bed soon [22:22:40] Seems to have dropped down