[00:45:34] <.generalnuisance, replying to bwm0> Thanks, sorry I missed this, filling out the form now! [01:01:33] <.labster> Where is the section for everything all at once in the form? /s [01:01:55] <.generalnuisance, replying to .labster> its next to the everywhere [01:02:47] I think thats the section me and you signed up for [01:03:05] <.labster> Oh! I see it. It’s next to the Any key. [03:02:41] hey guys, just reading back now [03:03:54] bingbot is hitting you guys >500k times a day, i'm not sure exactly how crawl-delay works but it sounds to me like it's not being respected at some level, or it works differently than you think [03:06:16] <.labster> I assume that's per-TLD? [03:07:29] Ive never quite understood how robot.txt really work it never seems to work as you expect [03:07:51] <.labster> it works as well as the shitty bot on the other end works. [03:08:08] Lol [03:12:47] yeah this is the actual answer lmao [03:14:47] do you have control over the robots.txt for the wiki subdomains? [03:23:45] hey @paladox is there a reason mw134 only shows 2.6 cores CPU? [03:23:55] on https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?orgId=1&var-job=node&var-node=mw134.miraheze.org&var-port=9100&from=now-24h&to=now-1m [03:28:56] so did the mw rotation helped? [03:30:02] <.labster> I haven't been browsing heavily but it feels like it helped. Should look at the numbers to see if it's not placebo. [03:30:16] it has significantly helped yeah [03:30:29] you no longer have a 33% chance of having a request take 10-20x longer [03:34:05] it hasn't really helped the swiftobject stores on cloud12 though, which is a bit surprising to me [03:34:18] probably worth doing a deeper comparison to swift11 vs swift12 [03:34:29] since 11 is ok [03:56:58] actually who could talk the most about how the swiftobject are loadbalanced, etc? can images end up cached by cp or are those requests always gonna hit the swift servers? [04:25:34] my understanding is CosmicAlpha knew the most regarding swift yet quit due to stress and drama etc. with how things played out, unsure would be welcomed back... [04:26:02] i just want to know how it's load balanced lol, all that other stuff is not relevant right now [04:30:55] ig ping void... [04:31:10] No idea swift came after my time in sre [04:33:38] based on my experiments it seems like the answer is yes, it ends up in cp33 [04:33:59] and then gets load balanced into mw* [04:34:12] and then gets sharded into swiftobject* [04:35:33] That sounds reasonable enough so i wouldnt be surprised [04:37:21] but then that begs the question of how all your swift object stores are so overwhelmed [04:37:59] maybe they're actually not, and this level of load is fine [04:38:28] I think swift is just that heavily used in general, but i have no idea [04:38:39] it doesn't seem like it should be [04:39:04] I've asked CA whether they'd be amenable to doing a KT regarding the current setup, since I don't believe a graceful handoff was able to happen [04:39:32] it seems to me like you guys should be averaging like ~20 qps hitting all of the swift servers combined [04:39:46] balances out to like 4 disk reads per second, which is nothing [04:40:03] Yeah i have no idea [04:40:18] er sorry i adjust that estimate up to 30 qps [04:40:31] but really, same idea [04:43:51] total disk reads seems about 10x higher than it has any reason to be [04:44:11] unless there's additional stuff hitting swiftobject* besides the mw nodes [04:56:12] maybe the image dumps? not sure if those are directly interacting with swift or still going through mw [04:58:34] something is just not adding up though. on a machine like you are averaging like 700 kb/s outward network transfer, but you're reading 7000kb/s, doing 300 disk ops per second, and fully maxing out the disk utilization [05:00:05] you have some sort of problem, because swift shouldn't be working that hard [05:00:17] maybe a @paladox question [05:42:04] <.labster> Image dumps are still manual so if no one is answering you it seems kind of unlikely [05:44:00] <.labster> /me is starting to wonder if Southparkfan started running SETI@Home one day for fun and we all just kept copying the config [05:47:30] Wonder if it might have anything to do with the fact that the container server is missing info for a large percentage of files [05:48:59] [1/2] Miraheze shouted into the starry void, and the void answered back. [05:48:59] [2/2] In 2024, an intrepid group of internet pioneers reveal its answer to the world... [05:49:14] Could also be a replication problem caused by one server being completely full [05:50:06] But paladox and UO were the two who put in the most work on swift, so I'm not 100% certain I can diagnose the issue. [05:50:52] yeah this seems plausible. i don't really know how swift does backoffs though [05:51:10] really i don't know much about swift at all. what is the purpose of the swift account server? why is it working so hard (especially every 30 minutes) [06:14:41] every 30 minutes might coincide with a puppet run, but that's a bit large of a spike for that [06:15:32] the account server is for authentication and container listing (as well as listing the objects inside the container) [06:47:16] yeah i guess i'm just confused why it's that resource intensive [06:47:28] i'm guessing there is probably something fairly simple that is misconfigured, across the board, for the swift stuff [06:48:01] because none of it is passing the smell test [06:48:51] i wonder if it's doing something crazy like getting a full list of the files on each server for each request [07:02:31] btw sorry @paladox I retract this question, i didn't realize the "CPU cores" number was also an average, rather than a current-point-in-time [07:31:41] fellas, several users in #general report language inconsistencies w/ Interface [07:42:38] <.labster> Do we know which wikis or have a phabricator task? [07:46:50] [1/2] @solidblock @wehentochter @Herr Person [07:46:50] [2/2] please name your wikis (re language bug) [07:48:24] I guess one of you can create a phab task [07:48:51] sipnayan.miraheze.org [07:58:12] [1/2] ok, my wiki is affected too [07:58:12] [2/2] softcell.mirageze.org [07:58:49] after refresh it's all Russian again [07:59:14] and another refresh gets me back to English [08:00:14] [1/2] Maniac Xena @ #general [08:00:14] [2/2] > OK, I think mediawiki system randomly fail to load locale setting [08:03:17] rs.miraheze.org [08:05:03] [1/2] phab task was created by someone else @.labster [08:05:04] [2/2] https://phabricator.miraheze.org/T10997 [08:06:46] <.labster> If only happens on some refreshes, perhaps the new mw instances aren’t set up completely. [08:07:24] cc Void and Paladox? [08:10:08] <.labster> Yeah it changes on refresh [08:12:00] can you look at the X-Served-By header to see which mw machine it's serving from [08:12:15] <.labster> ok [08:12:27] <.labster> now that I'm on a laptop [08:15:40] <.labster> not seeing that header [08:20:00] <.labster> they deleted "Main Page" and it's just not present in 404s [08:20:13] <.labster> which I think is probably a different bug [08:22:40] "they"? it that helps, I've moved my main page from main namespace to project one (+ disaplaytitke) [08:23:16] ofc set in MediaWiki page too [08:23:39] <.labster> I don't think that matters much [08:24:15] <.labster> I've seen it broken on mw143 and mw134 so far [08:25:36] well, those are the new ones [08:25:53] so that's probably what's going on. i don't know enough about your setup to give any other useful advice tho [08:28:59] <.labster> haha, I clicked Special:Random and got a 404 back because of the namespace translation [08:30:28] <.labster> It would be nice to get x-served-by in the 404s though. [08:34:53] <.labster> Updated the ticket, now we wait for people with more knowledge [08:53:25] This is the header response I receive from successful request of NavigationBar.js [08:54:26] and this is the failed version [08:55:40] I think there is no problem with protocol, and something goes wrong inside the server [08:57:42] what is it that makes it a "failed" version btw? [08:57:54] like what is the difference in output, besides the headers [08:58:42] Nothing. I just refresh the page and it turns fail. [08:59:41] I wonder when they pooled the new mw* if they made sure to rebuldLC, and stuff like that [08:59:52] Anyway its 4am im going to bed [08:59:54] nah i mean, what does "fail" mean [09:00:05] the language cache issues? [09:06:31] [1/2] I think so. [09:06:31] [2/2] BTW now I cannot reproduce the error anymore(Receiving response 200 or 304), but Purge button disappeared. Is maintenance process started? [09:07:17] Oh nevermind. Receiving 403 starts again [09:13:02] [1/2] Oh I think I have found some suspicious error [09:13:02] [2/2] The HTTP request results I show above comes from javascript file of wiki that hide site notice [09:14:06] Whenever the request fails, Templates inside the site notice get broken [09:14:19] and it shows 'Expression error: Unrecognized word "jun".' above the site notice [09:14:53] BTW the requested javascript file of my wiki doesn't have the word 'jun'... [09:17:06] ...Oh this IS the problem. The locale breaks down whenever the error notice shows [09:18:27] site notice from Meta? [09:20:24] No, site notice inside my wiki [09:22:15] This is it [09:22:20] [1/2] my wiki doesn't have an active site notice tho <:ThinkerMH:912930078646730792> [09:22:21] [2/2] although, Russian is set only in my user preferences, the entire wiki itself is English [10:20:26] Yes. That’ll fix its self eventually. I installed mw134 before the dns got updated so our Prometheus wasn’t actually reaching mw134 but I think our cps because we just wildcard everything to that if it’s not defined in our dns file. [10:23:49] Well the ones on swift12 have more storage and also share disks compared to the ones on swift1 which is a hdd per vm [10:28:14] I guess I shall just change the weight on our Swift object servers to 100. We have some at 4000 and I think one at 5000/6000 because Swift wasn’t really load balancing well as one server ran full but neitherless that’s happening now but just with the one with the biggest storage drive [10:49:01] i think if i were you i would try to figure out why the disk is so busy all the time [10:49:18] it doesn't really make sense to me. are you able to look at swift's logs? [10:49:43] it seems like each of those machines will only be serving like ~5 images a second, but they're cranking SUPER busy all the time [11:37:05] [1/9] okay so my remaining important tech suggestions, just from poking around for a few hours: [11:37:06] [2/9] - try and figure out why swift object servers are so maxed out. they should be reading about 10% as much disk as they actually are [11:37:06] [3/9] - take a look at the nginx logs again (probably the mw* ones will suffice, don't need cp) and see if the crawlers are being respectful of the robots.txt changes. there are probably some more you could consider blocking, like [11:37:06] [4/9] ```netEstate NE Crawler [11:37:07] [5/9] babbar [11:37:07] [6/9] SentiBot [11:37:07] [7/9] DataForSeoBot [11:37:08] [8/9] naver.me [11:37:08] [9/9] TurnItIn``` [11:45:25] [1/2] - there's also a few questionable traffic patterns that are crawling you guys in weird ways that you might want to ratelimit, but because those are legit UAs just being spoofed by bad people, probably better blocked at the IP level (assuming that's a thing you are do - i'm just guessing that they're all on the same IP but might not be) [11:45:26] [2/2] - set $wgMFStripResponsiveImages to false, it fragments the parser cache [11:45:58] the ac server is account/container. It runs the swift account/container processes. It is the most intensive hence why they recommend using ssds for it. [11:46:30] https://docs.openstack.org/swift/pike/admin/objectstorage-components.html#accounts-and-containers [11:49:20] do you know what the schema is for the account server's sqlite tables? [11:49:43] i'm just very confused how 30 queries per second to sqlite is causing so many issues [11:50:00] Nope [11:50:13] and i guess, also what is causing it to max every 30 minutes [11:50:14] it's not like one database. It's like a ton. [11:50:24] maybe it's some backup process? [11:50:34] we don't backup every 30 mins [11:50:55] well it's definitely doing something every 30 minutes that causes it to run out of resources [11:53:32] looks like wgMFStripResponsiveImages doesn't exist anymore. [11:54:04] it looks like it's puppet [11:55:53] hmm nvm [11:56:05] > 22,52 root /usr/local/sbin/puppet-run > /dev/null 2>&1 [12:07:41] could be container-auditor. [12:07:55] i'm seeing high read and io >50% [12:08:36] yeh looks like it's that [12:13:30] fyi i think i fixed the lang issue with MW [12:32:17] oh hell yeah [12:32:45] https://phabricator.wikimedia.org/T326147 [12:34:38] oh it was removed for 1.40 [12:35:32] So i should just set it to false @cookmeplox ? [12:46:10] yeah [12:46:23] basically what it will do is significantly increase your parser cache hit rates [12:46:47] in exchange for getting rid of a bandwidth optimization that most people hated to begin with, and became completely unnecessary like 5 years ago [12:48:36] done [14:29:04] could new mw's be the reason behind slow cache updates? [14:29:26] slow cache updates? [14:30:35] [1/3] single such report yet [14:30:35] [2/3] https://discord.com/channels/407504499280707585/407537962553966603/1120705060947624017 [14:30:36] [3/3] might be something different ofc [14:30:36] the firewall is open to the new mw* [14:57:17] Oh that would be parser cache