[00:52:03] <wikibugs>	 10serviceops, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team), 10User-brennen, 10Wikimedia-production-error: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) While the instability and latency problem nev...
[11:58:19] <hnowlan>	 This might be a big enough topic to require a synchronous conversation but - do ye have any immediate thoughts on what would be a good layout for thumbor in k8s as regards ingress and pod configuration? https://phabricator.wikimedia.org/T233196 
[11:58:54] <hnowlan>	 The migration to buster+py3 is done, the migration to bullseye is underway, there's a basic blubber config done and moving towards helm config etc is probably the next step 
[12:00:02] <hnowlan>	 second question around that: are ye okay with me doing the implementation here? I'm aware that thumbor's ownership is still a bit nebulous but lies mostly with PET at the moment but for example this ticket is still a serviceops one etc :) 
[12:04:46] <wikibugs>	 10serviceops, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10CDanis) @Esanders @VPuffetMichel hello from SRE, just wanted to make sure this task was on your radar for a quick patc...
[12:48:21] <jayme>	 hnowlan: I don't have any immediate thoughts apart from the fact that probably want thumbor behind LVS directly rather than using ingress. We still don't have experience with high traffic/direct user path stuff and ingress
[12:49:19] <jayme>	 but I don't have solid thumbor knowledge in general. So I bet there will be other things to consider
[12:59:45] <hnowlan>	 jayme: cool, sounds reasonable. 
[13:00:18] <hnowlan>	 My main concern is the fact it's currently using haproxy to balance connections to a very large number of instances and I feel like we can't do a 1-for-1 replacement of instances in k8s given the sheer number 
[13:00:22] <jayme>	 plus I'd say we're totally happy with you implementing it :)
[13:00:55] <jayme>	 but wasn't that like 160 instances per DC?
[13:02:38] <hnowlan>	 yeah 
[13:09:14] <jayme>	 it's okay to run 160 pods per dc
[13:22:57] <hnowlan>	 ah, cool :) 
[13:28:17] <cdanis>	 each pod has a max concurrency of 1, right?
[13:28:38] <cdanis>	 or rather -- each thumbor process
[13:46:56] <jayme>	 that's what I understood
[13:50:15] <hnowlan>	 yeah
[14:27:45] <_joe_>	 hnowlan: hi
[14:28:24] <_joe_>	 so yeah thumbor's main issue with moving to k8s is how to transition to a setup that makes sense in pods
[14:28:45] <_joe_>	 I would advice against running 1 thumbor process per pod, frankly
[14:29:08] <_joe_>	 that would mean moving the function that haproxy serves now from haproxy to kubernetes
[14:29:39] <_joe_>	 which I think is the kind of thing cdanis was worried about re:shellbox
[14:30:11] <cdanis>	 yeah
[14:30:19] <_joe_>	 where I think the problem is smaller, here it would indeed be pathological
[14:31:28] <_joe_>	 so yeah we're running 1 thumbor thread per CPU right now
[14:31:39] <_joe_>	 which is.. suboptimal, but ok
[14:31:47] <_joe_>	 let's assume we stay the same
[14:31:58] <_joe_>	 I can imagine a thumbor pod having say 8 workers
[14:32:04] <cdanis>	 how big is a thumbor instances' memory footprint?
[14:32:15] <_joe_>	 cdanis: depends on input sadly
[14:32:26] <_joe_>	 and still having haproxy in front anyways
[14:32:41] <_joe_>	 but I don't have any numbers
[14:32:45] <_joe_>	 (re: memory)
[14:33:01] <cdanis>	 nod
[14:33:37] <cdanis>	 I've run setups like this before _joe_, it's suboptimal but tolerable when your 'RPCs' take seconds to a minute
[14:33:42] <cdanis>	 and it is quite pathological if they are fast
[14:33:50] <cdanis>	 and yeah about 8 workers is about what I was thinking
[14:34:01] <cdanis>	 you want the CPU to RAM ratio to approx match that of the overall cluster
[14:34:09] <cdanis>	 or rather, that of the median machine in the cluster
[14:34:15] <cdanis>	 otherwise you run into binpacking problems later
[14:34:55] <_joe_>	 cdanis: yeah but thankfully we won't be running so many thumbor pods for that to become a problem
[14:35:04] <hnowlan>	 I don't have good maths on it but thumbor is relatively conservative as far as memory usage, a lot heavier on CPU 
[14:35:21] <hnowlan>	 I have half a dashboard for digging into that, will share when done
[14:35:23] <_joe_>	 hnowlan: thumbor is a fancy frontend for imagemagick for us :)
[14:35:35] <cdanis>	 _joe_: ok good :)
[14:35:41] <hnowlan>	 _joe_: so you envision having haproxy still in place in k8s? 
[14:36:04] <_joe_>	 hnowlan: we could use something else, but why move away from a setup that basically works for us?
[14:36:15] <_joe_>	 hnowlan: but if you have alternative ideas, I'm all ears :)
[14:36:26] <cdanis>	 I'm imagining each pod has about 8-16 workers with an haproxy in front and whatever sidecars are needed
[14:36:36] <_joe_>	 ^^
[14:36:51] <_joe_>	 we can find out the ideal number of workers/cpu/etc later
[14:36:58] <_joe_>	 but that's basically how I see it
[14:37:00] <cdanis>	 make it parameterizable in the helm if that is easy
[14:37:06] <_joe_>	 it is
[14:39:04] <hnowlan>	 _joe_: nothing too creative from me, sidecar seems like a sensible approach 
[14:39:31] <hnowlan>	 and yeah they just seem like numbers we can tune up and down 
[14:42:40] <jayme>	 if we run multiple thread per pod, do we still need haproxy in front?
[14:42:48] <jayme>	 *threads
[14:43:22] <jayme>	 or could we rely on readiness probe in that case? I'm not sure what else it is that haproxy does in that setup
[14:45:15] <jayme>	 different load balancing than RR maybe?
[14:46:27] <jayme>	 or is that one haproxy per pod? Loadbalancing the workers in that pod?
[14:47:48] <hnowlan>	 one haproxy per pod 
[14:48:44] <jayme>	 ah, okay. Makes more sense now. Thanks
[15:12:23] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson) @akosiaris Any update on moving forward with this decom?  I could really use the rack space.
[15:35:38] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Cmjohnson)
[15:54:44] <wikibugs>	 10serviceops, 10Editing-Team-Request, 10Editing-team, 10MediaWiki-extensions-Score, and 3 others: Reduce Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10Esanders) > One point of clarification: I originally thought the debounce value meant "we'll parse after the user stop...
[22:27:13] <wikibugs>	 10serviceops, 10Observability-Logging: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy - https://phabricator.wikimedia.org/T313099 (10colewhite)