[06:47:21] @paladox leave it as is please. We will check again in a few hours and see if it goes down once it's finished dealing with loginwiki. [06:47:39] I think PoolCounter is making stuff far slower than normal to process [06:47:42] Potentially [07:12:39] @paladox redis crashed [07:33:15] @paladox this is a disaster [07:49:56] @paladox the prewarm jobs are a very different pattern [09:37:57] we could try increase the pool for ArticleView [09:59:15] i've just seen it do 1k jobs in 15 seconds [09:59:20] so it's clearly capable [09:59:26] something is stopping it [12:24:46] @paladox do you want to go back to 2 workers at 50% wikis? [12:25:01] No idea how we'll ramp up to 100% [12:25:11] Maybe 50->100 was too much at once [12:25:31] we've tried a lot of things. It seems because we have 6k+ wikis. we need a lot and a lot of prewarm workers which we carn't do [12:25:35] I've asked Daniel for his thoughts in #mediawiki [12:26:28] @paladox ye, the longer this goes on, the more I am thinking our only option is more cores to task [12:26:35] By a significant expansion [12:26:42] we carn't [12:26:48] i'm not sure you are hearing me [12:26:50] we carn't [12:27:07] @paladox we can't as Miraheze will stop within the next couple of releases [12:27:24] There is no mediawiki without parsoid in future [12:27:25] yeh, out of my hands really. We don't have the resources. [12:27:32] not like i can magic up resources [12:27:38] Did @owenrb ever post the disks to the DC? [12:27:48] i'm not sure [12:27:56] That's why I pinged him [12:28:02] extra disks won't help with resources. [12:28:09] we've used up pretty much all cores [12:28:41] we need the disks to increase db space [12:29:09] also need to replace cloud14 disk as they are slow and kinda faulty [12:29:20] Ye were struggling [12:30:36] Finance hasn't been updated in months with the transfer to the US being in progress [12:31:18] But as of then, it left us around £1.5k at the end of calendar year [12:40:45] @paladox why do we have 10vCPUs but 24 cores on cloud10 [12:41:06] because we are dismantling cloud10 [12:41:17] Ah [12:41:23] it should have been done a year ago but with disk failures it's stoped it for now [12:41:45] need to buy hdd disks now for the other servers to move swift data from cloud10 [12:41:50] which complicates things further [12:47:56] Paladox what would be need at minimum to satisfy the immediate need to solve the issue for the time being? [12:48:24] @paladox [12:49:34] Seeing that a job is created per page, i'm going to go with a lot of runners for the prewarm job (something along the lines of maybe 50+?). At the moment it seems stable with 50%. I've increased the cores to 6 and workers to 8. [12:49:55] i'll either wait a day to expand further or hours. [12:50:00] Is the main issue CPU then? [12:50:45] kinda yeh. Ram will also be an issue soon. Need to keep some free for a DB. But we at least have enough for like 10-15g expansion [12:50:49] Not yet, but can get them sent out today/tomorrow if needed sooner than later [12:51:05] @owenrb yes [12:51:06] Yes [12:51:08] And yes [12:51:43] Is cloud10 fully decom’d yet? And would that help at all? [12:52:03] No [12:52:14] Not yet, we still have db101/112 and we're using the hdd space for swift [12:52:16] as we ran out [12:54:11] Ok [17:06:16] @paladox you reached z! [19:51:21] hmm, I'm away for one weekend, and the ssl and import backlogs have doubled in size if not more then that [19:59:57] I mean, we get an influx since 1 november [20:00:02] * are getting [20:00:33] MacFan4000: I don't think you're allowed to go away [20:06:37] [1/24] Hi all! [20:06:38] [2/24] This year we're planning to freeze backport and train deployments Fri, Dec 22nd to Mon, Jan 1st. [20:06:38] [3/24] November holidays, a Developer Experience Offsite, and the normal December freeze means there are only 4 more deployment trains this year for MediaWiki: [20:06:38] [4/24] 13 Nov - wmf/1.42.0-wmf.5 (this week) [20:06:38] [5/24] 20 Nov - No Train (Thanksgiving on Thu 23 Nov) [20:06:39] [6/24] 27 Nov - wmf/1.42.0-wmf.7 [20:06:39] [7/24] 03 Dec - No Train (Developer Experience offsite) [20:06:39] [8/24] 11 Dec - wmf/1.42.0-wmf.9 [20:06:40] [9/24] 18 Dec - wmf/1.42.0-wmf.10 [20:06:40] [10/24] 25 Dec - No Train (Deployment Freeze) [20:06:40] [11/24] 01 Jan - wmf/1.42.0-wmf.12 [20:06:41] [12/24] All this information is on the yearly deployment calendar[0]. [20:06:41] [13/24] We freeze deployments every year for the end-of-December holidays and [20:06:42] [14/24] the busy fundraising season[1][2][3][4][5]. [20:06:42] [15/24] Thanks! [20:06:43] [16/24] Tyler Cipriani (he/him) [20:06:43] [17/24] Engineering Manager, Release Engineering [20:06:44] [18/24] Wikimedia Foundation [20:06:44] [19/24] [0]: [20:06:45] [20/24] [1]: [20:06:45] [21/24] [2]: [20:06:46] [22/24] [3]: [20:06:46] [23/24] [4]: [20:06:47] [24/24] [5]: [20:06:47] Wikimedia plan for Xmas ^ [21:06:42] i seem to be experiencing a logout about every third wiki request i process. that seems to be an unusually high frequency. [21:06:52] That is [21:53:35] i don't want to jinx it, but the logout problem seems to be okay now 🤞 [22:21:26] I had a feeling that this was an issue with Varnish, but there don't seem to be any changes to the config that would cause this. [22:22:54] @originalauthority if varnish is interfering with login, something has gone very very wrong [22:23:02] And we should probably be logging you all out [23:00:40] @rhinosf1 yeh it's the jobqueue script causing the job backup [23:01:15] the more keys, the more slower, the more the cron opens a new script as there's no lock. But neitherless because redis is single threaded, it means blocking. [23:01:25] I swear I hate the job queue [23:01:33] There must be a better way to do jobs [23:01:36] Well there is [23:01:38] Kafka [23:01:38] i've temporarily disabled the cron [23:01:43] But its pants [23:01:49] Because resources [23:02:36] maybe you can find out from wikimedia on the very minium specs for one? [23:02:53] whether we can do this in one vm. [23:03:05] and how much disk (disk will be a problem tho) [23:03:52] Minimum specs for Kafka is still impossible [23:03:54] I've tried [23:03:56] Multiple times [23:04:02] To find a way to make it possible [23:04:16] It is insanely resource intensive software [23:04:36] Ive argued with John about it numerous times [23:05:00] And said we should be doing it but I have never found a way [23:14:36] Multi-threaded redis would be great [23:25:05] was that issue today? [23:34:26] [1/2] yes. it was occurring from around 3 PM US EST until about 5 PM US EST. [23:34:26] [2/2] i typically don't mind being logged out on occasion, but it was killing me! [23:34:57] bad part was i needed to go back and double check handling of wiki requests that logged out while i was trna handle them [23:35:31] seems to be ok now tho [23:44:05] oh [23:44:12] do you use keep logged in? [23:44:21] if you don't, you'll likely end up logged out