[17:58:01] MatmaRex: Hi, IIRC from the Prague hackathon, you worked on "images" attribute in RL. Does this patch look correct to you? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaBadges/+/1126120 [17:58:12] I did it based on https://codesearch.wmcloud.org/deployed/?q=%22images%22&files=extension.json&excludeFiles=&repos= and couldn't really test it [17:58:45] Amir1: uhh no. i'm not sure what you're trying to do [17:59:38] I am myself not sure either 😅 [17:59:49] basically we have these images used in css https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaBadges/+/867ee90d060d2f895d2497ce3077a89a3b2a9d53/resources/skins/vector/wikimedia-badges.css [18:00:10] "images" are only for when you're using an ImageModule, which would generate the CSS by itself [18:00:31] and i don't think it can do list-style-image [18:00:38] you don't need to specify that when using images from CSS [18:00:55] ah okay, do you know if it's possible to just declare it as a file? [18:01:10] but why? [18:01:14] i'm reading https://phabricator.wikimedia.org/T388323 now [18:01:43] https://phabricator.wikimedia.org/T388323#10616898 this might be useful [18:02:52] okay. so that's about the file dependencies that RL tracks automatically. if any of these files are touched, the cache key changes (or something like that, i haven't looked at this in a while) [18:04:07] the files don't change every second but RL bumps it 8 times every second [18:04:27] Amir1: so, a few questions [18:05:00] how urgent is this? someone brought up that task today in platform team meeting, and i think Krinkle and JustHannah were going to look into it later this week [18:05:04] which I think is coming from the fact that they are not declared explicitly so mw thinks the dependencies have been removed and rebuilds the module [18:05:15] it is quite urgent actually [18:05:15] also, any chance that this is some difference between PHP 7 and 8? [18:05:22] I don't think so [18:05:30] this has been the case for a while now [18:05:41] I can take a look myself [18:05:44] also, when did it start? [18:05:58] x2 has been always quite under pressure [18:06:07] noone looked at why until last week [18:06:23] i think there was supposed to have been some refactoring in related code. if the start matches some deployment timing, that would make it easier to confirm the cause [18:10:04] I think conceptually this should have been a problem for a long time. If you don't declare the file dependency automatically, mw has no way of knowing that if it's changed without loading up the css/js files and parsing them. This is my reasoning [18:13:23] just the CSS. and yes, that's why we record these dependencies, so that we don't have to parse every time [18:13:34] anyway, i think i can reproduce something locally [18:13:48] i tried the thing from https://phabricator.wikimedia.org/T388323#10616898 in shell.php [18:14:13] and the asOf value changes every time i refresh a load.php request for the module. this is probably unexpected [18:15:37] i'll see if i can find out why it happens [18:17:17] Amir1: the problem is not whether they're declared or not. There is explicit logic (array_diff) in RL to make sure we don't store the list of nothing has changed in what we discovered. There will always be indirect dependencies, this is inherent to both CSS (background image) and LESS (@import). The "cache hit" rate is red herring, since it isn't a cache (you can see there are only 4K writes despite 75% cache "miss" on 400K reads). We [18:17:17] only store it for modules that have non-empty data, so false == empty array. [18:17:24] Amir1: can you say if the timing of the problem matches https://phabricator.wikimedia.org/T343492 ? or is that too long ago to see? [18:20:23] MatmaRex: it has been going on for a while it seems https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1152&var-port=9104&from=now-6M&to=now&viewPanel=2 [18:23:09] To clarify: there is no concept of declaring images (and I'd rather not introduce that either. This is a bug on our end is all). [18:24:05] okay, i think i see the problem [18:24:42] in Module::saveFileDependencies(), we store if there are differences between $paths and $priorPaths [18:24:51] so i added some debug logging, and [18:25:24] one of them has paths like this: [18:25:27] extensions/WikimediaBadges/resources/skins/../images/badge-silver-star.png [18:25:36] and the other: [18:25:36] extensions/WikimediaBadges/resources/images/badge-silver-star.png [18:25:49] lovely [18:25:49] so we must be normalizing them somewhere in the wrong place [18:25:55] let me fix that then [18:26:17] if you already know where that happens, sure [18:26:24] i haven't gotten to that point yet :) [18:26:25] I guess the normalise step moved from before to after comparing. [18:26:47] For now, I will just stop calling them via ".." [18:28:22] If it's been this way for 4 months can we keep it for a few more days and fix it without making unintuitive changes to other teams code only to revert again? As long as we revert I don't mind it just seems wasted time/risk [18:29:48] relative path declaration is counter-intuitive IMHO [18:34:42] (e.g. risky in the sense that if the image gets moved or the file gets moved, things gonna break and I have caused issues exactly because of this) [18:35:19] there are integration tests that prevent things from breaking that way [18:35:31] phpunit will fail if you delete any of those files [18:35:37] (thanks to this dep tracking) [18:35:55] Amir1: Krinkle: i think i have an acceptable patch for RL, give me a minute [18:37:53] Thanks! [18:38:44] Amir1: Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1126127 [18:45:06] MatmaRex: Thanks. Do you think it's okay to deploy it? It generally looks good to me and if you have tested it. It should be fine [18:46:34] Amir1: it works locally, but i am only vaguely familiar with this code [18:47:13] I will make sure I monitor everything and test it in mwdebug too [18:47:29] Amir1: it depends on the urgency, is the problem bringing down production right now? if yes, then the patch should be better [18:47:42] it's matter of risk [18:47:53] this is responsible for 80% of the writes on a whole cluster [18:48:07] if Pope dies or something, this can very well bring down everything [18:48:23] but we can go for a while without issues [18:49:24] (root cause of many outages have been one row getting locked over and over, which is the case here too) [18:51:01] i think this is okay to deploy for today, but in longer term, i'd like to look at this more closely [18:51:16] because this fixes the problem, but i feel like the fix is in a weird place and done in a weird way [18:51:30] Amir1: ^ [18:51:45] sounds good to me [18:54:40] i also would like to find out when the bug was introduced [19:06:49] number of tabs I just closed :D [19:19:49] MatmaRex: do you like shiny graphs? https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=All&var-shard=x2&var-role=All&from=now-1h&to=now [19:19:53] https://grafana.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&var-kClass=ResourceLoaderModule_dependencies&from=now-30m&to=now [19:20:15] this also produces 1TB of binlog every month, that will be gone in 30 days too [19:20:24] nice [19:20:39] tell me how many dollars i saved :P [19:20:52] MatmaRex: No bonus in your pay check [19:21:24] see it as CO2 bonus, do an air travel :D [19:23:00] we still have to keep three hosts for redundancy reasons so we can't downscale it but actually we were thinking of buying extra hosts since it was doing ten times more writes than a parser cache host (twice or thrice is okay but not ten times) [19:23:51] so you sorta saved 10K every year (two hosts would be 50K, we have five year refresh cycle) [19:25:23] heh. cool [19:29:53] Amir1: pope won't increase writes much. These are based on stable URLs after varnish coellesced misses. They might increase if the module changes or eg a group2 deployed or a varnish/ats eviction. Think if it more like Swift/thumbor. [19:30:08] They're static resources. [19:31:07] Unless pope dying adds a new skin or language wiki. Which it might in terms of traffic, eg some wikis might go from zero to non-zero traffic and thus add (1) miss [19:35:36] this RL module was being changed 8 times every second. I assume many CDN cache misses were due to logged in users with non-default gadgets accessing wikis and thus triggering mw load.php to take care of it [19:35:48] 8 time every second for enwiki only [19:36:45] so if more logged users come online who don't use default gadgets, that can bring down production since they hit mw appservers. Am I missing something obvious? [19:38:55] Since otherwise, there is no reason that we get tens of times of x2 row changes every second for a handful of RL modules only [20:05:28] Gadgets load in a separate request group. Load.php doesn't vary by cookie or logged in. [20:07:01] Amir1: (style modules * wikis * active languages/skins) ~ 30min CDN TTL [20:08:31] Backend load.php requests correlate quite loosely to pageviews if al all [20:08:42] if at all* [20:16:24] Let's say 100 style modules * 200 of 900 wikis in play * 2 skins module/desktop * 10 languages in play = 400K per 30min is 220 per second. [20:16:42] But most should make zero writes since nothing changed [20:16:51] A handful are affected by this bug. [20:17:53] "In play" just means has >0 req per 30min. [20:19:20] I hope that makes it sound at least plausible that it isn't based on volume [20:20:23] A handful of modules, say 5, would with the above numbers average to 11/s [20:46:01] modules= startup, which is most referenced and with shortest CDN TTL does 14/s total across both DCs. https://grafana.wikimedia.org/d/000000067/resourceloader-module-builds?orgId=1