[00:37:33] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) p:05Triage→03High [00:43:55] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) In T269459#7522285, Tim has a table that compares Dodo perf against PHP DOM performance. Over there, the table indicates that with PHP DOM (which is what we... [00:46:12] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) [01:22:39] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10Legoktm) On a temporary basis, the easiest option is to just depool one of the new appservers we just got in codfw and let you run whatever tests you'd like against... [01:26:53] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10Legoktm) T155645 has the current (2017) eqiad parsoid server specs, T231255 has current (2019) codfw parsoid server specs, and T271156 has the specs of the new codfw... [03:41:10] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) >>! In T297259#7555306, @Legoktm wrote: > On a temporary basis, the easiest option is to just depool one of the new appservers we just got in codfw and let... [08:08:08] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10Joe) I would suggest that instead of trying a test server, we should focus on making parsoid tests run on kubernetes, which is where parsoid will be running soon. [08:19:30] 10serviceops, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:22:26] 10serviceops, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) 05Stalled→03Resolved [08:22:38] 10serviceops, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [10:14:44] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10SRE, 10Service-deployment-requests: New Service Request tegola-vector-tiles - https://phabricator.wikimedia.org/T274390 (10akosiaris) 05Open→03Resolved a:03akosiaris tegola has been deployed for some time now, so I am resolving this. Fe... [10:18:06] 10serviceops, 10SRE, 10Toolhub, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10akosiaris) This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of... [14:57:10] _joe_, regarding T297259 ... i had a question. So, when you say parsoid will move to k8s, does that mean we don't have control over what kind of cpus are allocated? [14:57:22] <_joe_> yes and no :) [14:57:35] <_joe_> it means that by default we don't [14:57:58] <_joe_> but we can mark servers as "gen 1/2/3/X" based on their cpu class [14:58:09] <_joe_> and restrict parsoid pods for synchronous requests to run there [14:58:59] so, as noted in that task, we rae exploing the 'throw hardware at the performance problem' and it is pre-annual-planning time, and in product, we need to figure out if we need more hardware resources (of any ckind) for 2022. [14:59:16] <_joe_> right I was about to say [14:59:26] <_joe_> we should start at least /testing/ parsoid on k8s asap [15:00:02] <_joe_> I think parsoid could be migrated right now, even. legoktm: IIRC the only shellbox we're missing has nothing to do with parsing, correct? [15:00:22] we've all been asked to fil out our reqs by dec 17 ... so, in the interest of time, should we separate parsoid-k8s from perf testing? [15:00:30] <_joe_> oh sigh [15:00:35] <_joe_> dec 17th is pretty near [15:00:42] i.e. assuming raw perf will benefit parsoid, presumably that will translate over to k8s as well. [15:00:48] <_joe_> but we can spin up a "parsoid" test cluster on k8s in a couple days [15:01:07] <_joe_> and... we can also test php 7.4 :) [15:01:27] <_joe_> so we get a better idea of what gains we get there too [15:01:34] to finish my thought .. so running perf tests on raw hardware is a good enough proxy for determining how it will play out in a k8s and so maybe we can go with the depool a server strategy? [15:02:00] but i defer to you .. if you think you can pol up a parsoid test k8s cluster with newer hardware, that of course works for me. [15:02:12] and yes to php 7.4! [15:02:34] <_joe_> subbu: no I think it's better if we start testing on k8s sooner than later [15:02:39] by the time, we get parsoid being used fo rread views on all the the big wikis, i expect we might even be on php 8. [15:02:51] ok, works for me. [15:02:52] <_joe_> hopefully :D [15:03:12] <_joe_> so my idea is - by monday give you an IP:port where you can do some perf testing [15:03:31] <_joe_> and we can also play with pod size / cpu limits / which servers we run on [15:04:46] sounds good. [15:08:17] to determine if new hardware will actually help in any meaningful way, i need to compare perf on current prod hardware (eqiad) against newer hardware on k8s .. so, we may still need to depool one eqiad server for the duration of the tests ... unless scandium hardware is equivalent to the wtp10XX hosts. in that case, scandium is a good baseline. [15:09:56] <_joe_> there's a bit more complexity to consider [15:10:47] <_joe_> but I'll write some more on the task tomorrow [15:10:53] sounds good! thankx. [15:11:07] <_joe_> your goal is understanding performance of the single request, or your throughput? [15:11:16] latencies .. so single request. [15:11:21] <_joe_> ack [15:11:34] throughput can be handled by throwing more servers / pods ... [15:11:37] <_joe_> then for that probably physical hardware is ok as well [15:12:53] basically, we want to get percevied wall clock parse latencies with parsoid to not be wildly off compared to the current scenario ... this matters for sync parse requests .. like breaking events, for ex. [15:15:24] <_joe_> right [15:15:59] in the end, this only matters for pages beyond a certain size ... so, while we can tweak parsoid in many ways, in the end, using faster hardware may be a more efficient tradeoff from a developer time and effort pov. [15:16:05] <_joe_> ok, I think we can easily test newer hardware, but maybe [15:16:38] <_joe_> we can even think of picking special cpus for parsoid in the future [15:21:25] <_joe_> I would assume we need more hardware and resources if our plan is to move 100% to use parsoid for parsing [15:22:01] <_joe_> although I could also see us making compromises about the consistency of what we show the users [15:22:03] well, as we move more page view rendering to parsoid, won't the existing app servers also get freed up? [15:22:24] <_joe_> well it depends on how we do the parsing [15:22:44] <_joe_> if we to it via an api call or directly within the request (which is what I'd expect) [15:22:55] unless we want to allocate a whole new clsuter for parsoid and build it out and start retiring the old appservers as traffic to them dies down. [15:23:21] <_joe_> the transition plan is not clear to me [15:23:40] <_joe_> but yes, my point was more that parsoid costs more in cpu time than the raw parser [15:23:52] <_joe_> so for the same amount of work, we need a bit more resources [15:24:07] yes. true. [15:24:59] right now, parsoid<->restbase combo is pre-generation; with parsoid<->parsercache combo, we can look at whether we want it to be on-demand generation like with the core/legacy parser OR we want to also rely on some pre-gen. [15:25:22] <_joe_> I would greatly prefer relying on pre-gen [15:25:24] pre-gen won't have the same latency demands. [15:26:53] scott has opinions on all this too. but yes, we can explore pre-gen but need to figure out how to hook that all up in the parsoid <-> parsercache combo. [15:27:24] but, should definitely be possible if that is the preferred approach. [15:28:47] <_joe_> it would save all of us a lot of headaches and our users a lot of tail-end slowness [15:28:58] <_joe_> but yes that's something we can discuss later [15:29:57] <_joe_> for now I think just evaluating performance on a newer cpu and maybe even php7.2 vs 7.4 would be important [15:30:34] <_joe_> I *think* we should be able with minimal effort [15:31:08] <_joe_> subbu: are you just interested in using the test suite you use for parsoid, right? [15:32:15] no, actually on wikipedia pages of different sizes. [15:32:25] <_joe_> oh ok [15:32:38] <_joe_> for that kind of stuff SRE has a "standard" benchmark tool [15:33:04] ok. [15:33:10] <_joe_> which is nothing else than a horrible script I created to run perf experiments and legoktm and others turned into a semi-plausible software [15:33:32] <_joe_> so if you provide us some URLs to test, we can compare as much as possible [15:34:11] <_joe_> please keep in mind, serviceops is understaffed right now, so we'll do our best but the deadline of Dec 17th is quite strict [15:34:32] sure .. i can come up with a set of urls for you to run the benchmarking. [15:34:44] and understood reg. dec 17 .. i will work with product on my end. [15:34:54] <_joe_> we can probably depool one of the latest-gen appservers and use it for parsoid testing to give you some preliminary results [15:35:17] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) p:05High→03Medium >>! In T296303#7530670, @akosiaris wrote: > But then it hit me. Strict Affinity in IPAMConfig is... [15:35:26] ok. [15:37:24] alright, later today, i'll dump a set of titles on the phab task for benchmarking. [15:37:49] thanks! [15:37:58] <_joe_> I also suggest you reach out to my manager, while he's still technically sound and not lost in the ocean of excel macros [15:38:01] <_joe_> :P [15:38:14] lol. [15:38:39] <_joe_> one day I'll have to repay akosiaris for all this banter [15:38:43] <_joe_> but that's for later [15:38:58] reach out to your mgr and give him a heads up that i'm asking for your involvement in this performance benchmarking? [15:39:10] or heads up about the phab task? [15:39:36] did you lose some bet with akosiaris? :-) [15:39:48] <_joe_> subbu: akosiaris is my manager now :D [15:40:11] ah, oookaayy! [15:40:13] <_joe_> I just realized you might not know it [15:40:24] <_joe_> that's also (part of) why we're shorthanded [15:40:37] right, i remember now about the departure. [15:40:42] <_joe_> alex is dealing with management stuff and not with computers [15:40:59] <_joe_> but yeah, let him know you need this with some urgency [15:41:09] <_joe_> I have a clear idea of what we'll need to do now though [15:41:14] great! [15:41:46] <_joe_> I would've more strongly suggested to test on k8s immediately if we were interested in scalability as I suspect having less workers per php-fpm instance will help with parsoid as well [15:43:34] right .. makes sense. i think we may need both in the end. but, the k8s throughput / scalability part can be resolved separately. [15:43:48] <_joe_> yep [15:45:07] alrighty then! i'll go get some breakfast and pick up these threads in the afternoon. [15:50:27] 10serviceops, 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) wow, what an epic task this was with many subtasks. quite the journey. congrats and thanks to all [15:57:16] lol [15:57:55] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10akosiaris) >>! In T296303#7556538, @JMeybohm wrote: >>>! In T296303#7530670, @akosiaris wrote: >> But then it hit me. Strict Affi... [16:06:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) I'm sill lost unfortunately. My understanding was that `calicoctl ipam configure --strictaffinity=true` is basically t... [16:14:18] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10akosiaris) > Are you saying that, with strictAffinity: true no borrowing is happening regardless of the Strict Affinity value of... [16:17:21] _joe_: correct, the remaining shellbox is just on file uploads [16:20:49] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) >>! In T296303#7556639, @akosiaris wrote: >> Are you saying that, with strictAffinity: true no borrowing is happening r... [16:23:59] 10serviceops, 10SRE, 10Toolhub, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10bd808) >>! In T280881#7555760, @akosiaris wrote: > @bd808, any news on those? It was not clear to me that these were tasks that the service reque... [16:39:17] 10serviceops, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Automate issuing of TLS certificates in kubernetes clusters - https://phabricator.wikimedia.org/T294560 (10JMeybohm) [16:43:03] 10serviceops, 10Security-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review, and 2 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Jelto) Trusted runners need some additional configuration parameters (like enable Prometheus metrics). I tried to move some... [16:45:32] _joe_: so to summarize, for now we'll just depool one of the brand new appservers for Parsoid testing, but we also want to bump up the priority for parsoid-on-k8s? [16:46:33] <_joe_> I dont' think the latter is strictly needed now [17:00:58] 10serviceops, 10MW-on-K8s, 10SRE-swift-storage, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) I thought I had replied earlier, for now the plan is to test POSTing large files to Shellbox, identify what layers it fails at and fix those. A basic test wou... [17:06:59] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10Legoktm) Full logs are at https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-serviceops/20211208.txt - summary of today's IRC convo: * Goal is to understand perform... [17:08:09] _joe_, subbu: ^ please review for accuracy [17:20:19] legoktm, lgtm but maybe note that the dec 17 deadline is tight for serviceops and that the deadline might slip and that I'll manage that on my end. [17:32:09] I'll leave couple comments on there. [17:36:39] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) * We are also interested in a baseline, so numbers on a eqiad cluster server, after depooling * I believe the Dec 17 deadline is tight and so there aren't p... [18:32:24] 10serviceops, 10Phabricator, 10Release-Engineering-Team (Next): Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10thcipriani) [19:01:22] 10serviceops, 10SRE, 10foundation.wikimedia.org, 10User-Urbanecm_WMF (GovWiki): Investigate and restore foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Urbanecm_WMF) a:05Urbanecm_WMF→03RLazarus Hello @RLazarus, I discussed this internally, and the conclusion was the hard r... [19:06:23] 10serviceops, 10SRE, 10foundation.wikimedia.org, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10RLazarus) 05Open→03Resolved Perfect! Agree there's no need for a test in that case, we can call this finished. Thanks for following up. [19:06:37] 10serviceops, 10SRE, 10foundation.wikimedia.org, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10RLazarus) [20:11:37] 10serviceops, 10Parsoid: Compare Parsoid perf on current production servers vs a newer test server - https://phabricator.wikimedia.org/T297259 (10ssastry) @legoktm, @Joe I am starting with the url list in the description of {T280497} and adding one more to it (a page on which Arlo and Tim worked to fix Parsoid... [21:23:19] 10serviceops, 10SRE, 10foundation.wikimedia.org, 10User-Urbanecm_WMF (GovWiki): Investigate foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Urbanecm_WMF)