[14:29:06] What tools does the QTE team have at its disposal to monitor performance on the beta cluster? [14:29:08] I see we have Grafana, but I am struggling to understand what it is telling me. [14:29:11] I am wondering if it might be possible in the future to catch issues like T300075 earlier. [14:29:12] T300075: P75 Edit response time has almost doubled since Jan 11, 2022 - https://phabricator.wikimedia.org/T300075 [14:43:05] dwalden: we do not run any edit performance tests at the moment. I did some tests long time ago but it was a pain since users/servers got banned. I'm not sure how stable beta is to run tests, maybe it would work? [14:46:09] I don't know how you run those tests, so I am not sure. [14:46:20] For the most part beta is stable (there is occasional downtime) [14:46:45] Whether doing performance testing on a VM is going to give you valid data is another question. [14:47:20] When you say banned, do you mean blocked on the wiki or something else? [15:29:19] Afaik latencies vary too much in beta (by design more or less) for ad hoc testing of something like a full edit to provide an alertable time series, and there aren't enough changes for RUM data. We test it in beta for functional reasons that the instruments work before prod, which is useful, but not the actual data itself [15:29:35] We could have alerted save timing in prod on group 0 [15:29:55] And probably did but we missed it due to other ongoing alerts and incidents [15:31:06] That'll be something to confirm for sure or set up if not. We plot it in Grafana by group 1,2 already with multiple alerts [15:33:22] Though depending on how we define earlier, even our existing alert for group2 at the end would have fired on Jan 11 or 12 which is 14 days earlier already [15:35:16] RUM == real user metrics? You mean there aren't enough users doing edits on beta to provide useful data? That makes sense. [15:35:26] yep [15:36:04] dwalden: when you set up a bot or other process that continually targets a prod wiki with edits it's very likely to get blocked by community admins as being an unapproved bot as per our terms and guidelines. We can ask for the user account to be in the bot user group, im sure that'd be no problem [15:38:46] I wasn't thinking about writing a bot to target prod, as the Performance team obviously already have perf monitoring on prod sorted out. [15:40:41] I could write one to target beta, but there would probably be too many false positives. [15:41:31] Well, that's kind to say, but we don't have synthetic edit performance testing right now, only RUM which isn't very detailed. Eg RUM with global average aggregate https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1 vs 1 specific browser/wiki/page repeatedly with high detail profile https://grafana.wikimedia.org/d/000000282/webpagereplay-drilldown?orgId=1 [15:43:22] The latter helps us find smaller regressions, and also with more detail to investigate, and less false positive (not influenced by external trends, see variables at https://timotijhof.net/posts/2018/measuring-wikipedia-page-load-times/ ) [15:44:25] Eg a holiday in a country with on average slower connections or a trending topic with an above average article [15:47:54] dwalden: we've so far tried to narrow it down by getting alerts sooner (eg group 1 instead of 2), and to get more stable benchmarks in CI even before beta, so basically from both sides of beta but not beta itself. Fresnel could perhaps be used to also perform edits and get some baseline results in CI once an hour or so. This would be fairly stable ideally but at the cost of not being fully production configured like beta, but catch at [15:47:55] least the easy issues [15:48:27] That's something we don't do yet for edits. [15:54:32] Any reason we haven't done the performance testing of edits yet? [15:54:48] I could offer my services but I don't have any skills your team does not already have :) [15:55:17] (testing of synthetic edits, I mean) [16:29:30] Oh, on beta would it make sense to record metrics which don't depend on latency, such as writes/reads to the DB per minute? Or is the usage profile not consistent enough? [16:30:20] (and an individual tester could potentially find this out for themselves by parsing the logs) [16:38:39] (I understand that T300075 was caused by too many reads to the database) [16:38:40] T300075: P75 Edit response time has almost doubled since Jan 11, 2022 - https://phabricator.wikimedia.org/T300075 [17:21:04] dwalden: yeah that could work, providing on going insight into a baseline sense and then noting the abnormalities [17:21:45] Although this task depended on the user in question having edits on many different wikis within the same cluster [17:22:13] For most users it didn't do a lot, eg see m:Special:CentralAuth [17:22:39] But those with accounts on virtually all 800 wikis would've triggered these extra queries on each of their edits [21:27:35] TimStarling: did you have a patch in progress for the removal of titleformatter from linkbatch? I didn't see it, but that sounds like a good idea to me. RE: https://phabricator.wikimedia.org/T300311#7658258 [21:48:54] no patch in progress, let me have another look at it [22:09:15] so the actual reason for the TitleFormatter is the fact that the public methods isBadLink() and getGoodLinkID() accept a string as a parameter, presumed to be the PDBK [22:10:13] this is implemented by passing the string through to getCacheKey() with passThrough=true [22:11:14] ah this is in LinkCache, I forgot to follow the references there, and that's implied stable/public I guess. [22:11:43] not sure how widely that is interacted with directly vs just priming via LinkBatch. It does support other param types already [22:12:04] in 1.36 these methods were changed from accepting string only to string|LinkTarget|PageReference [22:12:32] to remove TitleFormatter we would first have to hard-deprecate string parameters [22:13:23] no migration was done in 1.36 so there are still plenty of callers that use strings [22:14:15] I'll write this on the task also [22:18:58] This work started with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/697803 for T278940 [22:18:58] T278940: Make LinkCache usable from PageStore - https://phabricator.wikimedia.org/T278940 [22:19:39] Interesting, so that consumes LInkCache directly, comparable to what Title does. Make sense. [22:19:48] That'd be the one "reasonable" consumer outside Title. [22:44:55] flame graphs are nice, but they show a throughput perspective on the system and so tend to encourage throughput optimisations [22:45:43] I've been thinking -- if we set an excimer timer to go off 1 second after request start, and made a flame graph from that, we would have a view of the causes of >1s latency [22:46:41] or choose some other threshold, or do it for multiple thresholds [22:47:43] point is, by starting the timer at zero instead of with a random offset, you get a snapshot of user-visible latency rather than CPU cost [23:06:18] I like the idea yeah, we could add it as third channel after -cpu, and -wall, e.g. -wall-slow or some such. [23:08:54] we could also change the threshold by entrypoint maybe depending on what we want from the "all" variant. E.g. we could do a much lower one for rest.php and load.php GET requests closer to 100ms [23:13:06] I wonder what that would mean for optimsing startup, e.g. if most requests are within 100ms and not captured but spend most of it in WebStart.php and the ones going to 200ms are only showing data after 100ms, that wouldn't identify WebStart if that were a significant portion. So maybe we'd split it into time before 1s, and time after 1s rather than anyting vs after 1s. Not sure if that's worthwhile or not. [23:14:45] startup time is pretty well visible in the existing flame graphs [23:17:38] but it's true that sampling at 1s offset from request start does introduce some biases [23:17:40] true, but I sometimes wonder how significant it is but I may be forgetting a correlation that we can rely on for this, e.g. if flame graphs see 10% in WebStart, that might be okay if most requests being profiled are very fast. We do have latencies plotted by buckets and quantiles which tells us p50 and up, but I still doubt myself at times [23:19:12] I suppose it puts a cap on its impact. If p70 is 150ms then 10% is not going to represent more than 15ms, but that's a lot of room for error, I think. Anyway, it's not that important. [23:19:28] It has enough detail and more importantly, its the same for all requests in a given entyrpoint more or less [23:19:35] we have the percentile metrics but afaik there is not a logging view of them, it's not obvious why a request is slow [23:20:29] yeah, and doing what you propose would help us catch that, assuming a low enough threshold and the slow thing included in the "extra" time, which it should be indeed for all but sstartup slowness. [23:21:01] looks like we're currently at about 60% capacity on /srv/ for webperf1002 [23:23:24] 99% of that is in /srv/xenon/logs [23:23:34] 158GB of 159GB [23:24:45] yep we only need to keep a few days worth, I think we're doing 3-5 days or so now [23:24:54] the flame graphs we keep longer and persist to swift [23:25:03] although our frontend isn't ready for that yet [23:25:53] https://phabricator.wikimedia.org/T244776 [23:33:29] ok now I am distracted by low hanging fruit in the existing flame graphs [23:33:39] can you grab time since request start from each thread, and record that with each sample? [23:33:42] did you know Skin::getLanguages() is 4% of index? [23:34:14] that way you can slice and dice after the fact [23:38:11] hi ori, yes MW could send that information easily enough, there would just be more work to do on the slicing and dicing side [23:38:46] currently MW is sending a single semicolon-separated list of stack frames to redis, which is not a very extensible format [23:39:12] looks like it was 1.5% last year https://performance.wikimedia.org/arclamp/svgs/daily/2022-01-13.excimer-wall.index.svgz?s=Skin%3A%3AgetLang [23:39:55] er that's not last year. hm right, we dont' expose those [23:40:21] last month [23:40:32] also you used wall time and I used CPU time [23:41:22] I'm bisecting [23:41:55] looks like it doubled on the 28th [23:43:02] Yeah, many of the skin aspects are result of DI refactoring which have lost much of the caching that SkinTemplate used to do. [23:43:19] For a few months it computed everythig three times in Skin, then SkinTemplate , then SkinMustache [23:43:47] getCategoryLinks changed to become a portlet, and seems now dependent on re-buildig the sidebar [23:46:46] https://grafana-rw.wikimedia.org/d/QLtC93rMz/backend-pageview-response-time?orgId=1&from=now-2y&to=now