[00:05:31] AaronSchulz: I'm looking at our prod errors ( https://phabricator.wikimedia.org/maniphest/query/1UpD5d6atJJ1/#R ) the oldest one is currently T193565 which is blocked on T226595. The patch for that is currenlty not passing CI, that might be a good one to brush off and land this week if possible :) [00:05:32] T226595: Refactor LoadBalancer connection pooling to be more efficient - https://phabricator.wikimedia.org/T226595 [00:05:32] T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?) - https://phabricator.wikimedia.org/T193565 [01:15:46] a Google thing I wish the WMF had: Google has an internal site called "rules of thumb" that gives you resource cost equivalencies -- i.e., X RAM costs as much as Y disk which costs this many compute units. One of the resource types you can convert to/from is SWE hours. [01:16:48] This really helps you think through which optimizations are worth the engineering effort and which aren't [01:18:02] ori: the exchange rate between hardware and SRE hours is based on purchase price vs hourly pay? If so, I imagine there's a nother built-in ratio that tells you the estimated time for optimising something to not use X amount of RAM or disk? [01:18:25] yes, exactly [01:18:42] or do you put in how many hours you think it would take? [01:18:50] seems hard to ballpark that for a general case [01:19:08] well, you generally go for optimizations where the savings are a large multiplier of the effort [01:19:38] right, it doesn't have to tell you how long it "typically" or "average" takes to save space or ram, it just has to tells you how many hours it can take before it's not worth it. [01:20:49] it's not perfect -- there are things the model doesn't capture, like peak demand, and it has nothing to say about latency, because the cost of latency is really non-uniform and hard to measure [01:21:13] we do a small part in the general ethos here in so far that we have WP:PERF and "space/mem cost" as third rank principle at https://www.mediawiki.org/wiki/ResourceLoader/Architecture#Principles [01:23:03] yeah I know that page (it's really good) [01:23:15] but like, one of the things I can help push through before the end of my fellowship (it's not related to abstract wikipedia but let's call it a 20% project) is expiring thumbnails from swift [01:23:34] the question is -- is it even worth doing at this point? I really don't know [01:24:25] thumbs are ~500 TB if I'm reading the tea leaves correctly : https://thanos.wikimedia.org/graph?g0.expr=(sum%20by%20(class)%20(swift_container_stats_bytes_total))%20%2F%201e%2B12&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [01:26:00] I don't know what portion of that we can reap because I don't think we currently log object accesses in swift, but let's guess half [01:28:46] hm.. well, some of it mibht also be attack risk and pace of growth rather than purely purchase cost, and re-curring costs every refresh. Never expiring any thumbs might make it difficult to estimate as time goes on how much space we need, other than looking at what we have and hoping access patterns won't radically change. When we expire things it would presumably be more predictable and correlated with current behaviour in recent [01:28:46] history only. [01:29:30] but once we start looking at tea leafs of whether disks and servers can be added fast enough, I've probably proved your point that if that's the bottleneck, it's not worth doing. [01:30:20] There's also the age-old question of whether we still /need/ to cache thumbs in Swift, given two-layers of HTTP caching including disk-backed. [01:30:28] I mean storage has gotten really cheap, I had no idea, the last time I bought an actual hard drive was like a decade ago [01:30:57] at that point a larger ATS-backend cluster dedicated to cache_upload might make more sense [01:31:05] I don't know how well that scales [01:31:10] like 20TB disks are 300 USD / 250 GBP [01:31:21] yeah, maybe [01:32:05] I recall last time we talked about this, the argument that came up was that varnish-backend is unreliable and doesn't persist restarts despite being disk-backend. [01:32:12] and back then we only had one central varnishbackend cluster [01:32:27] whereas now every POP has their own frontend+backend and then direct to applayer [01:32:58] I'm assuming ats-be does actually persist to disk in a way that survivces restarts, given that was a main reason for switching I believe. SRE would have to confirm. [01:34:17] There's also the age-old question of whether we still /need/ to cache thumbs in Swift, given two-layers of HTTP caching including disk-backed. [01:34:22] yeah that's a really good point [01:34:27] all that to say, it's a lot harder to wipe/lose backend cache, and at least at the current scale ATS is presumably "better" at managing expiring stuff like this than Swift. But if we scale it up to 10x the size, I don't know where ATS's limits are and where/if it starts to have pains managing so much space. [01:34:43] whereas it used to take only one issue in eqiad and we'd lose all cache_upload [01:36:26] it might also help confidence in this direction if Thumbor was able to generate thumbs from sometihng other than the original, which would speed up the long tail of very large originals that take a lot of time to scale down every time. [01:37:00] godog: ^ pinging because the backlog will be interesting to you [01:38:45] yeah, I think it might not be worth it :( [01:42:41] on an anonymous request, MW generates a random session ID and then fetches that session from Kask [01:43:05] no Set-Cookie header [01:45:58] Kask is an IKEA product, I assume [01:48:52] Kask is a cassandra client written in go as a microservice, probably best if I don't get started on that [01:49:10] it is not today's problem [01:51:48] ori: https://bash.toolforge.org/quip/pV9AXIIBa_6PSCT9cFfz ;P [01:54:12] after it's done talking to kask for 4ms (in the strace log I made), it connects to nutcracker and asks for ChronologyProtector positions [01:58:45] why are we still using Redis for ChronologyProtector? I think it's the only thing using it now [01:58:59] TimStarling: aye, that seems like a bug. I reviewed this with Aaron a while back when we re-actiated the multi-dc project in 2018. If I remember correctly, the original plan was to not trust our ability to set cookies or query params in all scenarios where we can db writes and do http redirects, and so it fellback to generating the key based on current IP/UA and then also assuming we're always in a scenawrio where maybe we recently did a [01:58:59] write from such rare cross-domain scenario. [01:59:31] What we ended up agreeing on, and I recall removing the wiring code for this in MeidaWIki.php, is to no longer support writing to CP from cases where we can't do a query param or cookie. [01:59:44] which means if it is still reading when there isn't a cookie, it's useless afaik as that can never lead to anything. [01:59:58] s/cookie/cookie or query param/ [02:05:07] it makes three connections to mcrouter [02:10:51] CP details at T254634 [02:10:51] T254634: Determine and implement multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 [02:11:23] TimStarling: as for Redis, yes it's agreed Redis is fine to keep being used for CP, it's no longer conceptually treated as being replicated for CP, so SRE can turn that off anytime. [02:12:03] we'll probably move it to a dedicated mcrouter at some point, or if we give it the same treatment as tokens on the assumption that small short-lived data will not be pushed out, then maybe even as-is without a dedicated memc cluster [02:12:12] but it's not blocker for multi-dc [02:12:42] shaving 4ms off of every anon app server request sounds pretty nice [02:15:03] the session store read I'm less familar with as to why it does that. Maybe something about the current PHP logic needing to have a session id in memory and not being able to change it after the fact so that if the session is written to it knows which session ID to use, and knows it isn't already used so it's querying cassandra to find a `false` to determine that it isn't in use? I'm just guessing here, certainly not my code :) [02:15:40] yeah I can look into it locally [02:16:21] might have something to do with legacy $_SESSION compat as well. [02:17:52] if e.g. we don't have an oppertunity to set the ID after the fact. There's probably a comment from Anomie somewhere saying this is "temporary" thus calling the spirit of https://bash.toolforge.org/quip/AU7VTzhg6snAnmqnK_pc [02:18:34] I've parked the ChronologyProtector task to look at tomorrow and see what else we left behind that can be removed [02:19:37] I haven't found any smoking gun in this strace output which would explain a ~20ms performance difference [02:23:37] TimStarling: hm.. checking a few boxes you probably already checked: after warmups, single url?, actually reaching local codfw appserver, no cross-dc req directly from mw? [02:23:56] beyond that: local envoy may be misconfigured for some seemingly-local things to be dispatched to eqiad instead of codfw [02:24:18] maybe kask/cassandra are slower in codfw if you can isolate that. [02:24:35] https://phabricator.wikimedia.org/P32126 [02:24:49] what about e.g. load.php with nothing, how much does that differ, which shouldinvolve no sessions or CP. [02:25:18] the differences are comparable to the standard deviation, but I tried a few runs of 100 and eqiad was always faster [02:26:11] also the servers are both R440 purchased in 2021 [02:27:28] No CPU scaling differences? [02:27:31] I did a ping from the same src-dest pairs mwmaint and local appserver, in case maybe networking latency differs but both get 0.1-0.3ms pings to their equiv appserver [02:27:33] I know that's caused issues before... [02:27:43] CPU scaling/"performance mode" [02:27:51] how are you running strace? [02:27:56] I can check that [02:27:58] aye, maybe a CLI benchmark would help rule that out [02:28:26] ori: I attached to one worker process and ran a comparable number of requests to the number of workers [02:28:54] sudo strace -p 48746 -ttt -T -s200 -o Foobar2.strace [02:29:42] TimStarling: I'm a bit uneasy about removing the ancient wmfLoadInitialiseSettings func, could use a second pair of eys in case I've missed something stupid that a basic mwdebug check wouldnt' catch. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/818648 - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/818649/ [02:30:31] I'm working towards CS.php just doing $wgConf->settings = ...; but need a few more steps before I get there [02:45:54] I'm catching up on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/579653 [02:46:53] so you have a benchmark of that change now? [02:48:33] TimStarling: I do, https://phabricator.wikimedia.org/T169821#8118758 [02:50:26] hasn't SiteConfiguration::getAll() gone from being called rarely to be called every time? [02:51:31] TimStarling: yes, it used to be deferred to when we call $wgConf for cross-wiki settings which is typically in post-send I believe for jobs and such. [02:51:55] the main thing that made that operation slower than reading/parsing JSON was that the cache miss path also had to figure out the matching dblists [02:52:02] which I've taken out by doing the php array index file for that [02:52:54] there's also a few optimisastions I made to that class last year earlier on the same task [02:54:25] https://gerrit.wikimedia.org/r/q/bug:T169821+project:mediawiki/core [02:54:26] T169821: Try to make wmf-config/wgConf's per-wiki configuration cache redundant - https://phabricator.wikimedia.org/T169821 [02:58:19] ok [02:58:56] I gave +1 [02:59:18] with opcache, you would expect deferred loading of InitialiseSettings.php to not help [02:59:37] for the codfw vs eqiad test are both machines depooled or is the eqiad one serving traffic? [03:00:02] however, SiteConfiguration::getAll() is still slow and caching it was part of the reason for the config cache [03:00:25] TimStarling: ack, and it was orignally a serialized php file, not json. that might've made it slower when that happened. [03:00:32] we could reintroduce something here in apcu perhaps [03:01:10] although with how slow apcu recursively copies large structures, I'd still say it's probable to be slower as well. worth measuring. [03:01:56] or if we're fine generating PHP files in prod at runtime (SRE willing) we could have it write a /tmp/.php executable file like l10n cache [03:02:09] that'd definitetly be faster than getAll() [03:05:17] we're talking 1.5% of CPU according to excimer, maybe we have bigger fish to fry [03:05:32] micro-optimisation of the relevant code may help [03:06:13] if we can make it 2x faster then maybe a cache is not so important? [03:07:03] ori: the eqiad one was serving traffic [03:07:09] I'm happy to bench it more, but to me it was just a pragmatic win, getAll feels slow but from measuring it reading/parsing json was slower. [03:07:40] def won't mind adding cache back if I measured it wrong or if there's a faster way we can agree on [03:08:02] add "use function array_key_exists" for a start [03:08:06] and we're splitting up IS.php a bit so more mtime tracking to do [03:09:27] whicih should be fine, it's just me and Amir being lazy [03:09:38] (ref https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/799272 ) [03:39:15] the CPU frequency governor on the app servers is set to powersave, which only samples CPU load and adjusts frequency every 10ms [03:40:36] so Reedy's theory is plausible [03:50:00] according to https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt , ""powersave" sets the CPU statically to the lowest frequency within the borders of scaling_min_freq and scaling_max_freq" [03:53:56] which is not true because I can see it not doing that [03:54:43] yeah, the documentation is out of date [04:31:34] cpufreq_powersave.c is very short and I can believe that it's not doing anything [04:31:51] I think the new code is in sched/cpufreq_schedutil.c [04:33:25] the governor is apparently not doing anything, but the scheduler is changing the frequency [04:35:35] TimStarling: SiteConfiguration is not yet namespaced, so use function appears to not make a difference. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/819225/ ran a bench on prod appserver before I realized it wasn't namespaced. [04:35:48] I [04:35:50] O [04:35:59] I'm signing off soon. [04:36:17] few days left in SF timezone [04:43:11] o/ [04:43:24] the intel pstate driver has its own "powersave" freq governor that masks/overrides the one you're looking at [04:43:33] Krinkle: good night! [09:00:05] ori: thank you for the ping re: backlog, appreciate it! [09:02:21] indeed a good point re: whether we could get away with not having thumbs in swift at all, I don't know the "working set" of thumbs, for reference swift sees ~3k rps at peak, thumbor serves 1/100th of that or about 30 rps (ballpark) [18:51:11] godog: ack, so if we go that direction I imagine it'd be a combination of actually trying to decrease swift hits first (e.g. expand cache_upload a bit, or tune more for this purpose), which could take a lot or a little depending on where we are on the curve right now with our current TTL. I imagine the 1d TTL is probably a lot more hards on cache_upload than cache_text if indeed they share the same, given files cache much less often and [18:51:11] don't have a 14d hard limit from MW side for config/skin benefit. [18:52:05] and then beyond that, assess if user latency of those misses is acceptable after tuning cache_upload, and then to see if we can serve that from thumbor well enough or if it needs added capacity, possibly k8s could help with that as well to smoothen out certain failure scenarios. [19:04:41] I'm not sure the motivation or staffing is there for that [19:06:42] godog isn't on Swift either (he's in observability now), so the interest in T211661 is a combination of nostalgia and the realization that it's very close to landing (or can be, if the rationale is there) [19:06:42] T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 [19:15:02] ack, there's in theory a multimedia-related team spinning up or morphed from Structured Data with unclear scope/purpose. [19:15:44] If we current reorg drift continues, I imagine Swift would likely end up under Data Persistence if the relevant staffing was added/moved/brain-copied to that team. [19:16:03] esp with Cassandra moving there recently.