[00:52:03] 10serviceops, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) While the instability and latency problem nev... [11:58:19] This might be a big enough topic to require a synchronous conversation but - do ye have any immediate thoughts on what would be a good layout for thumbor in k8s as regards ingress and pod configuration? https://phabricator.wikimedia.org/T233196 [11:58:54] The migration to buster+py3 is done, the migration to bullseye is underway, there's a basic blubber config done and moving towards helm config etc is probably the next step [12:00:02] second question around that: are ye okay with me doing the implementation here? I'm aware that thumbor's ownership is still a bit nebulous but lies mostly with PET at the moment but for example this ticket is still a serviceops one etc :) [12:04:46] 10serviceops, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10CDanis) @Esanders @VPuffetMichel hello from SRE, just wanted to make sure this task was on your radar for a quick patc... [12:48:21] hnowlan: I don't have any immediate thoughts apart from the fact that probably want thumbor behind LVS directly rather than using ingress. We still don't have experience with high traffic/direct user path stuff and ingress [12:49:19] but I don't have solid thumbor knowledge in general. So I bet there will be other things to consider [12:59:45] jayme: cool, sounds reasonable. [13:00:18] My main concern is the fact it's currently using haproxy to balance connections to a very large number of instances and I feel like we can't do a 1-for-1 replacement of instances in k8s given the sheer number [13:00:22] plus I'd say we're totally happy with you implementing it :) [13:00:55] but wasn't that like 160 instances per DC? [13:02:38] yeah [13:09:14] it's okay to run 160 pods per dc [13:22:57] ah, cool :) [13:28:17] each pod has a max concurrency of 1, right? [13:28:38] or rather -- each thumbor process [13:46:56] that's what I understood [13:50:15] yeah [14:27:45] <_joe_> hnowlan: hi [14:28:24] <_joe_> so yeah thumbor's main issue with moving to k8s is how to transition to a setup that makes sense in pods [14:28:45] <_joe_> I would advice against running 1 thumbor process per pod, frankly [14:29:08] <_joe_> that would mean moving the function that haproxy serves now from haproxy to kubernetes [14:29:39] <_joe_> which I think is the kind of thing cdanis was worried about re:shellbox [14:30:11] yeah [14:30:19] <_joe_> where I think the problem is smaller, here it would indeed be pathological [14:31:28] <_joe_> so yeah we're running 1 thumbor thread per CPU right now [14:31:39] <_joe_> which is.. suboptimal, but ok [14:31:47] <_joe_> let's assume we stay the same [14:31:58] <_joe_> I can imagine a thumbor pod having say 8 workers [14:32:04] how big is a thumbor instances' memory footprint? [14:32:15] <_joe_> cdanis: depends on input sadly [14:32:26] <_joe_> and still having haproxy in front anyways [14:32:41] <_joe_> but I don't have any numbers [14:32:45] <_joe_> (re: memory) [14:33:01] nod [14:33:37] I've run setups like this before _joe_, it's suboptimal but tolerable when your 'RPCs' take seconds to a minute [14:33:42] and it is quite pathological if they are fast [14:33:50] and yeah about 8 workers is about what I was thinking [14:34:01] you want the CPU to RAM ratio to approx match that of the overall cluster [14:34:09] or rather, that of the median machine in the cluster [14:34:15] otherwise you run into binpacking problems later [14:34:55] <_joe_> cdanis: yeah but thankfully we won't be running so many thumbor pods for that to become a problem [14:35:04] I don't have good maths on it but thumbor is relatively conservative as far as memory usage, a lot heavier on CPU [14:35:21] I have half a dashboard for digging into that, will share when done [14:35:23] <_joe_> hnowlan: thumbor is a fancy frontend for imagemagick for us :) [14:35:35] _joe_: ok good :) [14:35:41] _joe_: so you envision having haproxy still in place in k8s? [14:36:04] <_joe_> hnowlan: we could use something else, but why move away from a setup that basically works for us? [14:36:15] <_joe_> hnowlan: but if you have alternative ideas, I'm all ears :) [14:36:26] I'm imagining each pod has about 8-16 workers with an haproxy in front and whatever sidecars are needed [14:36:36] <_joe_> ^^ [14:36:51] <_joe_> we can find out the ideal number of workers/cpu/etc later [14:36:58] <_joe_> but that's basically how I see it [14:37:00] make it parameterizable in the helm if that is easy [14:37:06] <_joe_> it is [14:39:04] _joe_: nothing too creative from me, sidecar seems like a sensible approach [14:39:31] and yeah they just seem like numbers we can tune up and down [14:42:40] if we run multiple thread per pod, do we still need haproxy in front? [14:42:48] *threads [14:43:22] or could we rely on readiness probe in that case? I'm not sure what else it is that haproxy does in that setup [14:45:15] different load balancing than RR maybe? [14:46:27] or is that one haproxy per pod? Loadbalancing the workers in that pod? [14:47:48] one haproxy per pod [14:48:44] ah, okay. Makes more sense now. Thanks [15:12:23] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) @akosiaris Any update on moving forward with this decom? I could really use the rack space. [15:35:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) [15:54:44] 10serviceops, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Esanders) > One point of clarification: I originally thought the debounce value meant "we'll parse after the user stop... [22:27:13] 10serviceops, 10Observability-Logging: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy - https://phabricator.wikimedia.org/T313099 (10colewhite)