[04:04:58] 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) Status update, leaving this for the end-of-year break: [[ https://gerrit.wikimedia.org/r/748876 | 748876 ]] and [[ https://ger... [06:51:35] 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10Joe) >>! In T287130#7585276, @RLazarus wrote: > @joe: My recollection is you were going to take care of the blubber and docker-pkg part... [09:04:25] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Joe) a:05Joeβ†’03None [12:07:36] 10serviceops, 10Security-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review, and 2 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Dzahn) I agree that option 1 sounds misleading and not great and option 5 sounds overly complex / brittle. Fully on the sa... [14:59:23] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [16:40:55] https://twitter.com/ProssimoISRG/status/1473354582306742273 ... <-- legoktm [17:02:46] <_joe_> finally a kernel without bugs [17:16:38] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) I made this new page that shows all fingerprints in a cen... [17:18:34] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) {F34892980} [17:20:23] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) [17:21:58] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) The part we haven't talked about yet is that also for the... [17:27:13] whew, for a minute I thought you were actually not sarcastic with that comment and then I remembered who was writing it [17:27:27] (I had also not clicked through on the tweet, don't judge too harshly) [18:02:25] <_joe_> apergos: it's about RIIR [18:07:07] _joe_, legoktm or whoever can answer this ... so, i am trying to make sense of something here ... so, parsoid eqiad cluster has 24 servers and the mw appserver cluster as 70 servers. [18:07:30] <_joe_> subbu: correct [18:07:31] even though parsoid is slower ... so what is the appserver cluster doing besides parse pages. [18:08:02] <_joe_> subbu: respond to every non-cached-or-cacheable-at-the-edge request [18:08:09] <_joe_> which includes all logged in users [18:08:23] so, if we move to parsoid serving all read views, i am tryin to see if we need to scale the parsoid cluster according to 70 x (parsoid expected slowdown) [18:08:25] <_joe_> in fact, the cache hit-ratio in parsercache is pretty high [18:08:32] 80% only. not that high. :) [18:09:01] <_joe_> so, assuming the parsoid code does the parsing, and that would be slower and consuming more memory [18:09:12] <_joe_> we would need to add computing power [18:09:20] <_joe_> how much, we can calculate [18:09:31] <_joe_> you're saying it's 80% slower in parsing? [18:09:35] https://docs.google.com/spreadsheets/d/1LHsq1ry2GtSa62kjSb6QFNbBSn1YhE3wvBp1gnM0V1s/edit#gid=0 is where scott and i are working. [18:09:44] <_joe_> so, a lot of the parsing happens on jobrunners btw [18:09:52] _joe_, no no .. i said parsercache is 80% hit rate. [18:10:26] <_joe_> oh sorry, yes [18:10:36] <_joe_> well 80% *after* wancache though [18:10:43] <_joe_> I think the real number is much higher [18:11:15] ok. [18:12:10] <_joe_> it's a two layer cache, so the total hit ratio is wancache_hr + (1-wancache_hr)*parsercache_hr [18:13:43] ok .. so, based on legoktm's benchmarking and some analysis scott and I are doing, i think we can probably get away with 50% more servers (conservatively) and in practice, we can do much better likely with addtional work. [18:14:14] we may have to run more benchmarks in jan and feb after fresh production deploys and a wider set of pages. [18:14:53] <_joe_> subbu: we also have to see if php 7.4 gives us more advantages [18:15:38] true, there is that as well. so, i feel comfortable saying 50% more now and in a couple months, we'll probably have better estimates. [18:16:14] <_joe_> 50% more might not be accurate, it might be 50% more for parsing [18:16:20] we still have to figure out how to deal with actual raw wall clock latency being much higher (2x or more) on p95+ size pages. [18:16:30] <_joe_> and I can evaluate the number of parses/s quite accurately [18:17:26] appserver cluster is all parses ... [18:17:48] <_joe_> not really :) [18:18:06] <_joe_> a lot of requests are for load.php for instance, and many are for parsed articles [18:18:27] <_joe_> now, how much cpu time is dedicated to parsing can be evaluated from the flamegraphs though [18:19:06] ah, ok .. so, 50% more servers is probably even more conservative than I thought then. [18:19:57] <_joe_> https://performance.wikimedia.org/arclamp/svgs/daily/2021-12-21.excimer.index.svgz says 17% of the cpu time is spent in Parser::parse [18:22:26] ok .. so, ~12 of the mediawiki appcluster servers are currently handling all parses. [18:22:27] <_joe_> on appservers [18:23:32] <_joe_> more or less, yes [18:23:41] <_joe_> there's also the jobrunners though [18:23:46] <_joe_> oh, for comparison [18:24:27] <_joe_> on parsoid, 65% of the time is spent on Parsoid::wikitext2html [18:24:37] <_joe_> that's 65% of the cpu we actually use [18:24:43] <_joe_> which is not that much [18:25:11] parsoid cluster also does a lot of other things .. html2wt ... lang variant conversion, but wt2html is the majority. [18:25:44] <_joe_> yeah what I mean is that the 24-node parsoid cluster already handles all the parsing that appservers + apis + jobrunners do [18:25:52] but, that is all hanlded by the 24-node parsoid cluster. right. [18:25:53] <_joe_> they just save it somewhere else [18:26:49] <_joe_> actually, I think if we just subbed in the call to Parse::parse() with a remote call to the parsoid cluster, and at the same time turned off restbase caching, we would see a similar load than we see today [18:27:18] we think parsoid might be about 23% slower on the work load, very roughly. [18:27:21] <_joe_> subbu: anyways, I plan to try to move a fraction of parsoid traffic to k8s as soon as we've solved the last shellbox migration [18:27:27] anyway, it looks like we probably just upgrade the hardware to newer hardware so that raw wall clock latencies improve .. plus continue to do additional perf work in parsoid ot improve latencies for p90+ [18:27:43] as a strategy for parsoid read views. [18:27:43] <_joe_> yes [18:27:53] is "a remote call to the parsoid cluster" a reasonable thing? [18:27:54] <_joe_> and we go to 7.4 for parsoid sooner than later [18:27:54] ok, good, thanks .. this is much better than what i feared. [18:28:07] <_joe_> cscott: if it's just for uncached views, maybe? [18:28:42] I would *like* to keep the parsing load on the parsoid cluster, but i don't want to introduce new network/bandwidth/latency nightmares -- can we reasonably redirect this load, or are we going to end up with parsoid load on the appservers as we integrate parsoid read views? [18:28:44] <_joe_> it would allow us to unload the heavy lifting to a separate cluster that could run with better cpus / larger memory limits / reserved hardware / specialized php configuration [18:29:07] is there a way to pin certain jobs to a certain cluster in jobrunner? [18:29:43] we're thinking about doing a post-edit render to preload parser cache in the same way that restbase does to reduce latency, and (at least initially) that wouldn't have to be done synchronously [18:29:55] <_joe_> sure [18:29:59] we could spawn a jobrunner job to do it, and (maybe) pin it to the parsoid cluster? [18:29:59] <_joe_> and that can work the same way [18:30:12] <_joe_> sure, changeprop can do it [18:30:26] <_joe_> we can add a rule that sends that job to a different cluster [18:31:22] <_joe_> the huge advantage over the old parsoid-in-nodejs model is that now 1 call to parsoid is enough to get a full render of the page, without background calls to evaluate luasandbox etc [18:31:50] <_joe_> I'll try to write something down tomorrow, now I'd go afk as it's almost 8 pm (dinner time here :)) [18:31:52] another big benefit is that there is no user-specific rendering in parsoid, that's all done as postprocessing [18:32:13] so once we're switched over there basically shouldn't be any asymmetry in the parsing side between logged in and anon users [18:32:31] (so the parse load on appservers due to logged-in users should basically disappear) [18:32:48] <_joe_> well it's the same now, it only appears as long as it's a cache miss [18:33:56] cscott is saying that even for cache misses, we don't need to suffer the full parse latencies for user-specific rendering in the majority of cases. [18:34:29] <_joe_> AIUI it works that way now as well though [18:34:45] <_joe_> that's why wikipedia is not unbearably slow when you're logged in [18:34:53] <_joe_> unless the page is not in cache anymore [18:35:08] <_joe_> which usually doesn't happen for large pages that are visited often [18:35:32] "user" is one of the possible keys to parser cache, i guess it's not always set, we're talking about reducing the cases where its set still further. [18:36:00] used to be you could (eg) set a custom "stub article length" which would pretty much guarantee that the parser would do a fresh parser for you every time. [18:37:30] <_joe_> oh nice [18:44:59] in most cases parsoid just doesn't implement little-used features like "customizable stub length", which is one way to drop that variance :) [18:45:12] thanks _joe_ for talking this through. [18:45:53] but in cases where per-user customization is still useful, parsoid does/will do it as a postprocessing step after the page comes out of parsercache, so basically user should (knock on wood) *never* cause a cache miss post-parsoid. [18:46:55] <_joe_> oh so that's still controlled by parsoid, uhm [18:51:37] 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10Dzahn) >>! In T287130#7585276, @RLazarus wrote: > - Regularly rsync the database from the active host to the passive one Hello @RLazaru... [19:12:02] _joe_ when I say "parsoid" there I really mean something like `ParserOutput::getText() will do it when the parsercache object being fetched is parsoid output` not that it would run on the parsoid cluster or something like that. [19:22:34] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Legoktm) f it has different sets of keys for the same hostnames... [19:26:17] * legoktm reads up [19:37:16] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving πŸƒπŸͺ’), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Yea, well.. unless you argue "if we switch over to anothe... [20:26:36] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) So, I have been debugging this again and summary is: the chain here is (traffic layer) -> envoy (443) -> ngin... [20:40:27] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) >>! In T266509#7195286, @ssastry wrote: > looks like something is intercepting requests to parsoid-rt-tests.wiki... [20:41:31] ^ if some random backend behind traffic layer that is NOT mediawiki wants to use "/static/" in an URL.. it gets mystery 404s [20:41:35] due to templates/text-frontend.inc.vcl.erb:if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; [20:42:35] afaict.. and after debugging the entire chain we have on that testreduce1001 host, which is: (traffic sandwich) -> envoy -> nginx proxy -> nodejs ... oh man [20:43:14] and just because they happen to use /static/ as well. this was for parsoid-rt-tests.wikimedia.org [20:44:07] 10serviceops, 10Parsoid-Tests, 10SRE, 10Traffic, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [20:46:32] mutante: adding it to https://gerrit.wikimedia.org/g/operations/puppet/+/f65d06ffe59bdc1c30a584b5a48f7237d54af7c1/hieradata/role/common/cache/text.yaml#13 should fix that I think [20:47:26] 10serviceops, 10Parsoid-Tests, 10SRE, 10Traffic, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) Hey traffic, I added you to this ticket because I think a line in varnish config above, the one that handles URLs with "static" in... [20:47:35] majavah: oooh, does it? very good [20:48:20] thanks! want to mention it on the ticket or should I [20:48:57] feel free to, I'm on my phone [20:49:08] 'k, will do. ty!