[00:34:22] Hi [00:38:07] <.labster#0000> Hi [00:45:05] <.labster#0000> I'm starting to wonder if Phabricator has good tasks for people not on the SRE team to be ready to take on. Fairly early on, config change PRs were common enough that this was a common path in. [00:48:46] <.labster#0000> There are people who want to help, and it would help if there were shovel-ready tasks that don't require NDA. [00:58:39] I could take a look through and see what can be done without server access [01:01:32] Is a simple support matter, they may need to configure a Wordmark depending on what skin they are using. [01:03:37] - extension install. Doesn't technically need to have a security review (per use by WMF). Someone could do the early work in [[Tech:Adding a new extension]] [01:03:51] [01:05:08] - Config request, check with currently assigned volunteer before working [01:06:22] If someone is familiar with OpenStack Swift (or has the ability to do some local testing for me) please get in touch. [01:07:05] - This is a full extension security review, if someone wants to take a look at that. [01:08:44] - Extra extension reviews [01:09:12] extension reviews are normally for security engineers though? [01:09:13] https://meta.miraheze.org/wiki/Miraheze_Volunteering_Opportunities#Security_Engineers_(Infrastructure/MediaWiki) [01:09:44] or else appointed security reviewers [01:10:55] It shouldn't require NDA access to do the review, but whether or not we would accept the results of the review depends on how well it is done, as well as an additional review by an appointed reviewer (which we don't have yet). [01:12:41] - Extension bug that needs further investigation [01:13:02] In Wikimedia Phabricator, they have a tag "good first task" for those things. Maybe that's how it should work here as well. [01:14:21] - Extension(?) bug. Needs follow up and further investigation. [01:14:52] - Extension bug that needs further investigation [01:16:45] - Extension bug with stack trace, can ask on task for further/current logs. [01:19:53] There are also a lot of open low-priority tasks involving some sort of development or other that could get picked up. [01:41:35] I'm not super familiar with Grafana, is there a way to see a fleet overview? [01:42:01] Like I'd love if I could see the CPU utilization for all the servers, without having to click through to each of the 50 servers [02:22:36] you can view different dashboards here, i saw some stuff that looked like cpu for all app servers, not sure if there's a dashboard for 'all servers' https://grafana.miraheze.org/dashboards [04:15:06] Void or someone else with access can probably throw one together. Prometheus monitors all of the servers individually, but there is more than likely some `total(count….` Proxmox query for this. [04:18:55] really i want to see each of them broken out individually if possible [04:19:11] but without having to hit the dropdown 50 times [04:21:44] Yeah, it’s definitely possible. You can either have a separate widget (all on the same page) for each server, or you can have all of them on the same graph with each one toggleable on/off. [04:25:08] [1/2] Something like [04:25:09] [2/2] `sum by (instance) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))` [04:44:12] <.labster#0000> https://news.ycombinator.com/item?id=36386112 Front page again today [05:01:28] When is my interview with new york times 🤣 [05:01:41] poggers. also 6.2 mil page views is probably 9+ requests just for uncached content apparently? but most of that is probably bots so lol, no clue what # of pages are actually being viewed by people [05:02:05] Actually we did the numbers ngnix gets approx 9 million requests a day [05:02:35] Thats just on the mw servers [05:09:17] <.labster#0000, replying to zppix#0000> Honestly what happened lately is more appropriate to the Daily News. [05:09:50] Ew no I rather yahoo news at that point [05:10:20] <.labster#0000> Anyone have contacts at Gawker? [05:10:54] Ive never even heard of it lol [05:32:19] @zppix @paladox do you guys know how you load balance among the varnish servers? i noticed one of them is located in the US, but it seems to get pretty much the exact same amount of traffic as the others [05:32:32] is there a geographical aspect to it? [05:33:31] I havent been sre in about 2 years im not sure anymore [05:34:04] I have a sneaking suspicion that there is no geographical aspect to the load balancing, which makes the US varnish server kinda actively harmful [05:36:37] We use gdnsd, config should be at [05:47:34] [1/7] you probably want to block (via either robots.txt or nginx) the following UAs asap: [05:47:34] [2/7] ``` [05:47:35] [3/7] Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com) [05:47:35] [4/7] Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) [05:47:35] [5/7] Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) [05:47:36] [6/7] Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) [05:47:36] [7/7] Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)``` [05:50:44] [1/4] Consider blocking: [05:50:45] [2/4] ```Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)``` [05:50:45] [3/4] Consider ratelimiting: [05:50:45] [4/4] ```Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36``` [05:51:55] [1/2] Consider implementing nearly all of the robots.txt rules from , since nearly half of the traffic from even legit scrapers is garbage like [05:51:55] [2/2] `https://polcompball.miraheze.org/w/index.php?title=Special:CreateAccount&returnto=File:Techcratsoc.png` [05:53:16] yandex is trash so don't feel bad about it lol [06:01:49] who was saying matomo said 200k or something pageviews a day? [06:02:20] i don't have access to that dashboard but the nginx logs suggest it's more like 2 million a day, unless i'm badly misinterpreting [06:55:01] Might it be unique pageviews or human page views [06:55:13] Matomo needs an ldap account [06:55:38] i'm excluding more or less to humans, i think [06:56:07] Not sure. They were my only guesses. [06:56:29] I don't have access to Matomo either. [07:37:37] That was <@529521187492069396> [07:40:23] [1/2] But you might also be correct, Cook. [07:40:24] [2/2] https://discord.com/channels/407504499280707585/407537962553966603/1004552929199521812 [07:53:21] am I right in thinking that you guys use `thumb_handler.php` for private wikis, but static.miraheze.org for public images? [08:07:09] also are cloud10, cloud11 and cloud12 basically the same specs? [08:11:19] the distribution of resources across physical machines seems extremely suboptimal [08:11:57] img_auth.php [08:12:20] I believe @paladox was saying some are Dell, some are HP [08:12:33] like, afaict the reason your site is slow is that you have a crazy percentage of your core infrastructure on cloud12 [08:13:10] and cloud12 is one of your worst machines [08:17:38] if I were you i would immediately get rid of mw121 and mw121 and create mw133 and mw143 [08:18:25] cloud12 only has 24 cores total but it's got 1 core db (averages like 85% utilization), 2 swift object stores (95%), and 2 webservers (average like 98% lol) [08:19:21] each of which is 4CPU [08:19:47] so that's like...nearly half of your active CPU usage, but it's all on this one shitty server [08:20:13] and since you can't easily move the db or the swiftobject stores (because they're on disk), move the mw [08:21:09] i think proxmox supports overcommit, are you using it at all? you will almost definitely want to overcommit the cpu [08:23:54] because right now you're just sitting with 60% of your CPU always idle pretty much [08:24:27] but your most important stuff (mw, db, swift) is artificially constrained [08:44:22] pop quiz, what do your 4 worst-performing virtual machines have in common [08:52:41] @cookmeplox all cloud12 [08:53:35] the primary determinant for whether you have a bad, slow experience on miraheze is whether you get unlucky and hit a server hosted on cloud12 [08:54:01] i am not convinced this is a property of the server itself being bad, but rather the fact that it's trying to do so much at once [09:25:37] also do you guys use memcached for parser cache + wanobjectcache? [09:26:20] you have like 60GB of memory each on cloud13 and cloud14 that cannot possibly be getting used, but mem131/mem141 are only allocated 12GB of memory [09:26:33] you should make those like 60GB tbh [09:27:10] we're using about 50% memory on all but cloud12 at moment [09:28:05] i bet your memcached stuff is constantly getting evicted because of how little space it has access to [09:28:26] https://grafana.miraheze.org/d/0uBBwmsMk/memcached?orgId=1&refresh=10s&var-job=memcached&var-node=mem131.miraheze.org:9150 looks to be near no usage [09:28:39] but https://grafana.miraheze.org/d/0uBBwmsMk/memcached?orgId=1&refresh=10s&var-job=memcached&var-node=mem141.miraheze.org:9150 is very high [09:30:21] hmm that mem131 bit doesn't make sense to me [09:30:40] https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?orgId=1&var-job=node&var-node=mem131.miraheze.org&var-port=9100&from=now-30d&to=now-1m is showing like 97% memory used [09:30:51] @orduin how much memory is left on cloud13/14 [09:30:52] but the actual memcached process isn't using almost any of it? [09:31:03] could we increase it [09:35:28] seems that way [09:35:39] that's worth looking into [09:47:38] another tidbit: average request latency is 10x higher on mw12 than it is on mw13 [09:48:38] who is still around that has knows how to provision a new mw vm? [09:51:31] honestly i think literally just turning off mw121 and mw122, with no other changes, might be an improvement [09:51:54] that plus the robots.txt change is probably a net win [09:56:13] @cookmeplox provisioning mw is easy [09:56:33] creating the VM will be the bit that I don't trust the docs on [09:56:47] Only @orduin has the access [10:52:36] I’m not sure if we have the disk space to start more vms [10:53:08] https://phabricator.miraheze.org/P386 [10:53:27] sorry but can someone check https://discord.com/channels/407504499280707585/1120301128404439050 ? [10:53:55] Oh looks like we may [10:54:26] Could just do one on cloud13 and one on cloud14. And shut down the one on cloud12? [10:54:59] how much disk space do you need for a mw instance? [10:55:36] I think we have it at 25g each [10:56:45] how is cloud13 anywhere near disk capacity? [10:57:02] the only disk-heavy thing on it i see is db131, everything else is tiny [10:58:32] oh wait, wow, cloud12 is even more over-provisioned than i realized [10:58:44] because those mediawiki instances are 6 cores, not 4 [10:59:04] so you have 6+6 (mw) + 4+4 (swift) constantly busy, that's already 20 CPU out of 24 out of the way [11:00:30] Yeh it has less cores then cloud13/14 as well [11:01:43] then db averages 2...that machine is constantly throttled [11:01:54] no wonder requests take 10 seconds, there's no free CPU [11:02:03] Well we use thumb_handler for when thumbs don’t exist [11:02:16] We have a 404 handler in Swift proxy. So pretty much all use it [11:02:28] Private wikis use img_auth to display images tho [11:02:32] what is actually hosted on cloud11 and cloud10? [11:02:51] i see cloud11 has a lot of swift object stores [11:03:07] Cloud11 hosts Swift. And cloud10 db101/112 (we needed to decommission cloud10 but disks issues on cloud13/14 has prevented that) [11:03:25] oh db112 is on cloud10? [11:03:34] Yeh we moved it from cloud11 [11:03:55] are those normal mw dbs or something else [11:03:57] It stores misc databases. Such as icinga/matomo and I think phab [11:04:11] Matomo is the biggest (60+g) [11:04:26] db101 is normal mw db [11:04:37] alright [11:04:48] Been waiting for the extra disk space to sprung up more dbs and move the dbs [11:04:56] i think a ton of your problems will immediately disappear if you give cloud12 breathing room [11:05:08] because that's also what is making db121 suck [11:05:20] and making the swift stores on cloud12 suck [11:05:29] Oh [11:10:48] Tbh some of the Swift object are stored on cloud12 because we have a mix of hdd sata / scsi on cloud11 and one of the disk types were slowwwwe far too slow to handle anything and so load went up to 50 or 100. [11:13:53] makes sense [11:17:35] tbh i'd even make a 4th mw server on cloud13 [11:17:55] you're just not using the CPU, otherwise [11:19:42] Yeh you can also do it on cloud14 as well [11:23:11] i was thinking 1 new one on cloud14, 2 new ones on cloud13 [11:26:42] Oh [11:32:38] also can you change the CPU allocation of the db servers without restarting them? [11:36:52] is there anything that sets db142 apart from db131, db121, etc? [11:37:09] it has way higher usage [11:47:57] Nope, requires restarting the vm [11:48:14] i'm super curious about the distinction between db142 and 131 [11:48:26] it seems like you use them both for parser caches, and they should have nearly identical specs [11:48:36] Not really. Apart from cloud14 may be having disks issues as I explained [11:48:45] oh yeah, what's the deal there? [11:48:49] I notice performance issues that I don’t really see on cloud13 [11:48:56] this seems maybe disk-related [11:49:10] it's not BAD like the cloud12 stuff, but it does seem abnormal [11:49:18] Yeh [11:50:47] can you say anything more about the disk issue? like where you've noticed it? [12:11:07] I’ve noticed it when ssh’ing [12:11:19] Simple commands would take a while - even ssh was slow [12:11:45] It’s ssds so the disk shouldn’t be slow… [12:11:58] Think there could be a fault in the hardware or in the disks. [12:17:18] [1/3] okay. i'm gonna go for a bit but i definitely recommend trying to do: [12:17:18] [2/3] - robots.txt changes [12:17:19] [3/3] - mw12* stuff [12:17:36] i genuinely think most of the reputation for slowness is from the roulette wheel of people hitting cloud12 [12:21:16] We’ve always had the slowness even when we were on ramnode for all of our vms. [12:21:41] Wonder if we should increase php childs to like 30 in addition to the vm changes? [12:25:32] i wouldn't do that at the same time [12:26:42] Ok [12:53:29] Not another disk failure plz [16:23:51] I feel like I would have found something in smartctl if it was a disk problem. It's more likely something funky with the CPU or raid controller. Especially since the slowdown is not consistent within the span of less than an hour it can go from really bad to not any worse than another mw server. [16:29:20] Would you be able to do the mws on cloud13/14 @orduin ? [16:30:06] I would but I don’t have the perms anymore [16:35:25] You would like to do it yourself? [16:38:55] Only the mw servers. [16:39:56] I can do that, need to eat first [16:41:40] @orduin remember puppet won't bootstrap mediawiki. You'll need to run deploy-tool with pretty much every parameter [16:47:25] So cool said 2 mws on cloud13 and then one on cloud14 [17:00:58] (that i guess would be mw133, mw134 and mw143) [17:02:19] Yes. [17:02:57] of course you have to update https://github.com/miraheze/puppet/blob/master/modules/mediawiki/files/bin/deploy-mediawiki.py#L36 first before running it [17:10:49] Are you sure? [17:11:13] I'm pretty sure it would just refuse to prep. It should bootstrap fine. [17:14:51] I'd say it's probably better to not and to add after [17:15:17] https://github.com/miraheze/puppet/blob/master/modules/mediawiki/files/bin/deploy-mediawiki.py#L174 [17:16:04] it sets the servers variable to all the ones in that list unless you specify a list of specific servers [17:16:22] that var is used throughout even for rsync [17:16:42] Yes but you would be specifying a list if you are bootstrapping [17:16:55] You don't want to bootstrap all [17:17:05] ah, right [17:17:35] Bootstrapping all for no reason is a bad idea. The deployment tool is not fast. [18:36:42] https://github.com/miraheze/dns/blob/master/config [18:36:59] Cp3 is US and cp2 is GB [18:48:42] We have bingbot ratelimited with `Crawl-delay: 1`, could it be worth increasing that value? [18:57:07] Could do a pretty much copy of the website he linked to [18:57:32] They don’t really need to load the special or api.php pages i think [18:58:24] Yeah, you can see what I'm doing here: [18:58:38] https://en.wikipedia.org/robots.txt [18:58:43] Just wondering if upping the rate limit for bingbot makes sense based on the data [19:00:41] I guess just do it and see. (Would want to do this alongside the mw new setup servers) [19:00:52] 5 second delay isn’t going to hurt really. [19:01:08] robots is easy, new servers low on my priority [19:02:38] enwiki's will be useful for localization [19:05:52] See your dm’s @orduin pls [19:08:32] @orduin can I Dm too? [19:08:51] any time [19:33:51] working on setting up mw133/34/mw143 [19:45:09] Hey paladox, want any of your sre-related role tags back? [19:46:09] Sure i suppose. [19:46:22] thanks [19:47:10] No problem, feel free to ask me to shake 'em off again if the color gets annoying. 🙂 [19:47:17] 😄 [19:47:51] So the color being annoying is the only viable reason to remove them again? 🤣 [19:47:51] The color looks good on you [19:52:43] wait, paladox is back officially? [19:53:14] robots.txt change should be live soon, but may take a while to get through the cache [20:11:23] something i noticed is that we have read ahead set to read ahead on cloud13 but adaptive read ahead on cloud14. [20:11:35] this is on dell (idrac) [20:11:47] does anyone know anything about that and could that explain the slowness? [20:18:34] new robots.txt is live, I'll monitor UAs tomorrow and see if anything needs to be synthed a 403 in varnish [20:19:06] @paladox you good with deploy tool to install mediawiki? [20:19:31] Just waiting for mw133/34 to finish [20:19:43] but i gather it's just deploy-mediawiki with all the args [20:19:54] @paladox pretty much ye [20:21:43] deploy-mediawiki --config --world --landing --errorpages --l10n --extension-list --force --ignore-time --servers=mw133,mw134,mw143 [20:22:05] ^ should be it @paladox [20:22:12] thanks! [20:25:08] Ping me if it breaks [20:25:26] I'll be up another hour or so. I'm 90% certain it would be my fault anyway. [20:37:34] @paladox what's it failed with [20:38:36] i'm not actually sure [20:39:45] @paladox logs? [20:39:56] It's towards the start [20:40:32] @paladox full shell log? [20:40:47] [1/3] > Execute: sudo -u www-data rsync --inplace -r --delete -e "ssh -i /srv/mediawiki-staging/deploykey" /srv/mediawiki/config/ www-data@mw133.miraheze.org:/srv/mediawiki/config/ [20:40:47] [2/3] > rsync: [receiver] open "/srv/mediawiki/config/OAuth2.key" failed: Permission denied (13) [20:40:48] [3/3] > rsync: [receiver] open "/srv/mediawiki/config/PrivateSettings.php" failed: Permission denied (13) [20:41:30] maybe i need to add mw133/34 varnish side? [20:41:36] No [20:41:48] That's file permissions [20:42:21] Check it's writable by www-data [20:42:56] Or check it matches another mw* [20:43:01] [1/2] > root@mwtask141:/home/paladox# ls -lah /srv/mediawiki/config/OAuth2.key [20:43:01] [2/2] > -rwxr-xr-x 1 root root 3.2K Nov 21 2022 /srv/mediawiki/config/OAuth2.key [20:43:14] [1/2] > root@mw133:/srv/mediawiki# ls -lah /srv/mediawiki/config/OAuth2.key [20:43:14] [2/2] > -rwxr-xr-x 1 root root 3.2K Jun 19 20:18 /srv/mediawiki/config/OAuth2.key [20:44:11] @paladox I think it's best to empty /srv/mediawiki/config and then do it [20:44:23] Puppet can fix permissions after [20:45:38] @paladox use --force too as without it in varnish canary check will fail [20:46:03] i know, i used it with force [20:46:20] Not in this one [20:46:22] but i did it this way to see what's failing. As you can see it's the json being displayed [20:46:39] The json is only useful if the canary check dies [20:47:06] But ye, have the config dir empty for first run @paladox [20:47:32] ok [20:48:43] works now [20:49:08] 👍 [20:49:37] It's just puppet doing something weird then. We really should have puppet do a pull function but sadly deploy tool is push only [21:07:48] mw133/34 pooled in and i've unpooled mw121/22. @cookmeplox [21:09:34] seems the load jumped on mw131/32 so it's >10 [21:10:53] I just hit mw132 and that was slow [21:11:52] Okay testing and it gets faster [21:12:05] Might be opcache being cold [21:16:31] The number of active requests has dropped a lot @paladox [21:43:27] mw143 pooled now