[19:50:07] Gemini said cscott: thoughts on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1240461? It seems like a clear improvement over the status quo with no downsides, and it fits well with the existing code. It doesn't need to be a complete solution to be a useful contribution. I'm eager to get this landed so I can send follow-up changes to add annotations to Scribunto and ParserFunctions. [19:51:12] * ori might be huffing the Gemini glue today because of the 3.1 release [19:53:06] it was on my list of patches to review this week, for sure. [19:53:59] it seems like a pretty reasonable approach [19:56:31] Krin.kle has this cool gizmo that surfaces performance data in the UI: https://meta.wikimedia.org/wiki/User:Krinkle/Scripts/Perf.js [19:56:58] Timo has so much cool performance stuff [19:57:25] I was thinking it could be augmented to mark low-TTL pages and the reason "as-you-browse" [19:57:43] for interested WikiGnomes [19:58:10] I suspect we need something more like [[Category:Low_TTL_pages_you_should_fix]] to allow wikignomes to monitor and act on this longer term, but getting into timo's scripts isn't a bad start. [19:59:35] my worry is that we'll fix all the worst offenders and then someone will cut-and-paste some code which lowers TTL into a commonly used template and no one will notice, so I would like to make it easy to verify that the number of low TTL pages isn't "too bad". [20:06:26] IIRC Timo said the set of pages affected by the Module:Date bug was 2-3% of all enwiki pages alone. That means a bug like that being introduced means we start adding a quarter million articles to [[Category:Low_TTL_pages_you_should_fix]] [20:09:25] yeah i know, that's the hard part right now. But I think the idea is that that category should never be allowed to grow that large. [20:09:30] anyway, gave you a review [20:35:00] thanks for the review. Re: But I think the idea is that that category should never be allowed to grow that large: [20:35:55] the issue is fundamentally that a single edit on a single module can amplify to hundreds of thousands of affected pages, no amount of vigilant monitoring of the category will keep it from periodically exploding [20:37:16] actions with that kind of blast-radius potential deserve more direct monitoring. like maybe an icinga alert if p50 TTLs on parser cache saves drops below some threshold (say) [20:55:45] I feel like... we should address that at the source: just not permit pages with ttl lower than X value (maybe that's 24 hours) in article namespace, user/user talk.... [20:57:47] there is enforcement of minimums, but that only saves us somewhat. We want most TTLs to be substantially beyond the minimum [20:58:39] fair enough [21:02:12] oh, btw, speaking of parsing. The other day Krin.kle was wondering whether Tim's change to the frequency scaling governor (essentially ensuring that the app servers run at the highest clock frequency they can sustain, power savings be damned) was ported over to k8s. (It was.) This was a really big win for user latency. The nice thing about clock speed is you can just buy it. [21:03:07] But of course Amdahl's law, yadda yadda yadda, just because an X% increase in clock speed delivered Y latency win doesn't mean another X% increase will do the same. So I benchmarked a parse of [[Barack Obama]] and.. I don't have the results on this computer but parsing is still very substantially CPU bound [21:03:43] it would be nice to see the numbers when you get a chance. I guess there is a phab task someplace? [21:04:00] I wonder if anyone's well-positioned to make the case to the board / community that $X capex could easily buy us a Y latency win [21:04:22] if we have the data, we can probably find the right channels [21:04:25] I can dig up the phab task about the original win, but there's nothing in there IIRC about the remaining headroom to be captured from faster CPUs [21:04:43] that's what we'd want, yeah [21:05:42] and since latency for logged in users is apparently on the table for next year's work.... [21:07:20] IIRC the app servers are Xeons [21:07:22] https://phabricator.wikimedia.org/T315398 [21:07:58] added to my perf pile [21:08:01] https://phab.wmfusercontent.org/file/data/4knj3rq6tnbepl2xwrae/PHID-FILE-lsjv4slizloxzrlmvsee/scaling_governor_rollout.png [21:11:50] happy to collaborate on something if you draft a task or a page on wikitech or something. If motivating a full refresh on app server HW is too tall an order (though it shouldn't be, I think in a ranking of WMF line items by dollar-to-user-benefit ratio, it might be near the top of the list) a very modest goal might be to procure *one* machine with faster cores, pool it, and measure the [21:11:52] difference [21:14:26] I'm not at all involved in budget these days, I guess Krink le or Due sen might have that connection though [21:15:22] there's a call out for community feedback on the annual plan somewhere on Meta, I was considering writing something there but haven't had the time [21:21:33] https://meta.wikimedia.org/wiki/Talk:Wikimedia_Foundation_Annual_Plan/2026-2027 [21:57:36] GitHub ignores NVD "Disputed" status for CVE-2025-45769, so now Composer's new audit feature globally blocks installation of firebase/php-jwt v6... https://github.com/firebase/php-jwt/issues/620 via https://fosstodon.org/@shawnhooper/116095322301435230 [22:07:15] The latest mw appserver (wikikube-workerXXXX), is https://codesearch.wmcloud.org/puppet/?q=wikikube-worker2&files=&excludeFiles=&repos= [22:07:18] wikikube-worker2330 [22:07:38] Which brings me to T384970 [22:07:39] T384970: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970 [22:08:34] child of (private) procurement task: T382899 https://phabricator.wikimedia.org/T382899 dated Jan 2025 [22:09:36] I've subbed you both in case you don't have access [22:12:08] Hmm that probably didn't work for Or.i, there's a separate space restriction [22:14:38] so still Xeon (and I have access to these regardless) [22:16:42] Right, sre/procurement group [22:16:53] yeah, no access [22:23:08] re: data on CPU-boundedness, you can also discern it from the very close similarity of the CPU time and wall-clock time flamegraphs: [22:23:11] CPU: https://performance.wikimedia.org/arclamp/svgs/daily/2026-02-19.excimer.index.svgz [22:23:17] wall: https://performance.wikimedia.org/arclamp/svgs/daily/2026-02-19.excimer-wall.index.svgz [22:36:20] Yeah. I'm kind of surprised. I mean, I was gonna say that on most GET requests to page views and API, where no parsing happens, would be dominated by I/O, but I'm not sure because POST and edits are only a small portion of requests. I mean that are of course over represented here given sampling, but even then, how much? [22:39:23] Heh, I concluded the same before already. T140664 [22:39:24] T140664: Achieve predictable MediaWiki routing and cacheable skin data - https://phabricator.wikimedia.org/T140664 [22:39:41] And T302623 [22:39:41] T302623: FY2022-2023: Improve Backend Pageview Timing - https://phabricator.wikimedia.org/T302623 [22:40:10] There's a lot of stuff in the frontend that computes the same over and over because we assume CDN cache. [22:40:37] We could cache more of that. Move more behind ParserCache and simplify skin overall. [22:41:08] Also various tasks to speed up computation, eg libMustache [22:41:52] Instead of the current lightncandy which we use in a way that counteracts its core design and so isnt the speed up over PHP-Mustache that we selected it for. [22:42:42] (We store the php code in apcu with hmac, instead of on disk with opcache. TLDR: don't need to just runtime interpret directly might be faster.( [22:43:23] there's a lot that you could do if you could buy more Krinkles, but you can't [22:43:35] but you can buy faster cores :) [22:44:47] anyway they're not mutually exclusive [22:44:58] optimizing code and getting faster hardware [22:45:07] not the cheapest time to be buying new hardware [22:45:16] Yeah I don't think hardware gets us from 250ms to 50ms but a combo might [22:45:50] Reedy: don't make me pull out the budget [22:46:19] I heard about gpus and ram, what about cpus? They going nuts as well? Wouldn't surprise me with data center build ours everywhere, I guess it's all downstream effects. [22:46:32] build outs* [22:46:33] ssds are [22:46:37] mostly RAM and SSD, yeah [22:47:25] It's suspected that cpu prices will be affected too in the (near) future [22:47:26] unfortunately [22:56:49] https://www.theregister.com/2026/02/04/server_cpus_memory_shortage/ fwiw [22:57:22] tl;dr: analyst firm predicts; we don't know [22:58:50] https://www.reuters.com/world/china/intel-amd-notify-customers-china-lengthy-waits-cpus-2026-02-06/ this has a bit more meat to the story [23:00:50] still, "let's buy one server and do tests" seems viable [23:02:48] +1