[00:07:13] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye
[00:25:41] <wikibugs>	 10serviceops, 10Patch-For-Review: httpbb shouldn't alert when large pages are occasionally slow - https://phabricator.wikimedia.org/T323707 (10RLazarus) 05Open→03Resolved >>! In T323707#8419423, @Joe wrote: > Maybe if the page we're trying to fetch is that cumbersome, we should switch to a different, light...
[00:29:33] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye completed: -...
[00:33:30] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul)
[00:34:31] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) 05Open→03Resolved This is done
[07:18:57] <wikibugs>	 10serviceops, 10Patch-For-Review: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10Joe) 05Open→03In progress
[07:19:03] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe)
[08:03:32] <wikibugs>	 10serviceops, 10Patch-For-Review: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f3f4f962-02d1-4ce5-ba6c-e8c63e34958f) set by oblivian@cumin1001 for 3:00:00 on 42 host(s) and their services with reason: Appservers `...
[08:59:35] <godog>	 hi -- FYI I'm planning to failover to graphite1005 tomorrow, mw-config change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/861361 and I'll grab a deployment window after the utc morning backport, anything else I should know ?
[09:07:04] <_joe_>	 godog: just to check, did you verify we're allowing connections to graphite1005 from k8s?
[09:07:16] <_joe_>	 we might need to add it to our default network policy I guess
[09:08:27] <godog>	 _joe_: I did, you were even on the review :P https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/859575
[09:08:37] <_joe_>	 ahah sorry 
[09:09:19] <_joe_>	 1 week ago, yeah my brain has already evicted that cache
[09:09:57] <godog>	 fair enough, anything I should do to verify the change or once deployed that's it ?
[09:11:05] <_joe_>	 that's it 
[09:11:18] <_joe_>	 we don't really need to verify it I would say
[09:11:43] <_joe_>	 we can get into a mw pod and verify we can telnet to graphite1005
[09:12:13] <godog>	 SGTM, happy to do that (== copy/paste commands!)
[09:12:36] <godog>	 in other words: I have no idea what I'm doing
[09:12:42] <_joe_>	 ahah don't worry, I can do that, but also I'm happy to guide you through it
[09:13:00] <_joe_>	 if you want to learn how to debug stuff on k8s, this is a good start :P
[09:13:15] <godog>	 yeah might as well try, if you have five minutes for sure
[09:13:59] <godog>	 if not and you'd rather just do it that's fine too
[09:14:57] <_joe_>	 so, the process is, in a nutshell:
[09:15:03] <_joe_>	 * find a pod you want to debug
[09:15:09] <_joe_>	 * ssh to the node where it's running
[09:15:26] <_joe_>	 * find the PID of the container you want to inspect on that host
[09:15:51] <_joe_>	 * use nsenter -t <pid> -n <command> to execute commands in the network namespace of that container
[09:16:51] <_joe_>	 so step 1: ssh to the deployment server, kube_env mw-debug eqiad
[09:17:05] <_joe_>	 to get the config for the service mw-debug in the eqiad cluster
[09:17:17] <_joe_>	 kubectl get pods gives you the list of pods there
[09:17:23] <godog>	 thank you, yeah I was about to say, I got to https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#Managing_pods,_jobs_and_cronjobs re: kube_env
[09:18:33] <_joe_>	 kubectl describe pod mediawiki-pinkunicorn-758f67f5c4-bvhgf | grep ^Node: will tell you where this pod runs
[09:18:51] <_joe_>	 you also want to write down the docker id of a container from kubectl describe
[09:19:34] <_joe_>	 so now I picked 5f4842f85887c7b97b47d49464fb747b1e1dafdadb0de34ba2cb474fbd04674e, which is the httpd container
[09:20:11] <godog>	 *nod* makes sense
[09:20:14] <_joe_>	 on the node I found, kubernetes1019, I run sudo docker top <contianer-id> to find the PID on the host
[09:20:29] <_joe_>	 that pid I'll use in nsenter
[09:22:03] <godog>	 ok I think I got it! thank you
[09:22:49] <godog>	 ok of course this is udp/statsd so "who knows"
[09:23:05] <_joe_>	 nc is on the server 
[09:24:46] <godog>	 yeah but it isn't like I'm getting stuff back if the "connection" isn't working
[09:26:14] <godog>	 that's ok though I'll check graphite1005
[09:28:40] <godog>	 but yeah works as expected, we're good
[09:28:48] <godog>	 _joe_: thank you for the hand holding
[09:29:21] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Upgrade maps servers to node >= 14 - https://phabricator.wikimedia.org/T321789 (10WMDE-Fisch)
[09:30:17] <_joe_>	 godog: tbh I should've worked more on https://github.com/lavagetto/k8sh which automates most of this stuff
[09:30:37] <_joe_>	 when I feel like it I'll spend another weekend on it 
[09:30:57] <wikibugs>	 10serviceops, 10Platform Engineering, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, and 4 others: Move Kartotherian to Kubernetes - https://phabricator.wikimedia.org/T216826 (10WMDE-Fisch)
[09:31:06] <godog>	 heheh nice
[09:38:14] <wikibugs>	 10serviceops, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), and 3 others: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10WMDE-Fisch)
[09:41:04] <claime>	 Can we close this https://phabricator.wikimedia.org/T276994 or do we think there's still things to do to replicate standard mwdebug?
[09:53:36] <wikibugs>	 10serviceops, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, and 2 others: Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Clement_Goubert) I've prepared the change. Tell me when you prod on meta and I'll merge it.
[10:32:31] <_joe_>	 claime: "engineers can deploy experimental code without impeding others"
[10:32:41] <_joe_>	 that's still needed :)
[10:33:52] <claime>	 ah, right
[10:54:17] <wikibugs>	 10serviceops, 10Maps, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Disable unused services on maps nodes - https://phabricator.wikimedia.org/T298246 (10hnowlan)
[11:03:45] <_joe_>	 hnowlan: I think we'll have to roll-restart changeprop
[11:04:04] <_joe_>	 because I am adding new jobrunners and removing old ones
[11:04:08] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10JMeybohm) > However, according to graphs memory usage is looking pretty meager - something is missing.  Keep in m...
[11:06:31] <hnowlan>	 _joe_: ack - when? 
[11:07:21] <_joe_>	 hnowlan: I'd wait the afternoon to see if things normalize naturally
[11:08:17] <hnowlan>	 cool
[11:49:31] <wikibugs>	 10serviceops: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 (10Joe) 05In progress→03Resolved
[11:49:38] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe)
[11:51:24] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @RLazarus you can proceed with the decommissioning steps whenever you're ready. The servers are still in rotation as of now, and will need to be depooled first. I m...
[12:07:42] <hnowlan>	 hiccup with thumbor - firejail would previously at least limit memory use when doing something like an STL conversion using xvfb+3d2png, but on k8s the subprocess just tries to use a lot of memory and gets OOMkilled by the kernel on the kube node. if you give a worker a memory limit of 2gb it works, but that's not realistic and means one worker per pod 
[12:13:39] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes - https://phabricator.wikimedia.org/T324003 (10Joe)
[12:14:55] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team (Priority Backlog 📥): mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes - https://phabricator.wikimedia.org/T324003 (10Joe) p:05Triage→03Medium
[13:14:16] <_joe_>	 hnowlan: so you want to limit the memory used by a subprocess
[13:14:22] <_joe_>	 ugh that is indeed tricky
[13:14:46] <_joe_>	 jayme, akosiaris ^^ any ideas?
[13:23:50] <jayme>	 I'm not sure I understand why this did not happen in firejail. Wouldn't the process have been killed there as well?
[13:30:25] <hnowlan>	 I'm not hugely familiar with firejail myself but it doesn't seem so
[13:31:01] <hnowlan>	 if we could, raising per-pod limits to allow for spikes would be the path of least work 
[14:00:17] <jayme>	 hm. maybe some application is aware of the limit in k8s context but is in firejail and just tries to allocate a bunch of memory because it does not see the limit
[14:00:31] <jayme>	 although I had assumed firejail would enforce that via cgroups as well
[14:11:18] <_joe_>	 jayme: maybe the user we run as in k8s doesn't have the ability to find memory limits properly, or something like that
[14:12:45] <jayme>	 maybe...or they read frpm /proc/meminfo or vmstat instead of cgroups memory.stat
[14:13:17] <jayme>	 does firejail maybe provide correct values in meminfo/vmstat?
[14:13:34] <_joe_>	 I don't think so
[14:13:54] <_joe_>	 we need to run the process with strace at the very least
[14:14:07] <jayme>	 yes
[14:14:11] <_joe_>	 hnowlan: I'd start looking at the syscalls this thing does
[14:15:30] <hnowlan>	 sigh, sounds reasonable 
[14:16:39] <hnowlan>	 the kill in k8s is a legit cgroup kill, seems perfectly reasonable:  "[Mon Nov 28 17:45:27 2022] memory+swap: usage 921600kB, limit 921600kB, failcnt 6756" 
[14:17:07] <hnowlan>	 there's also a chance that there's a hugely increased memory use in xvfb in buster for some reason 
[14:23:47] <_joe_>	 so yeah I'd first test with firejail in buster heh :)
[14:24:08] <hnowlan>	 oh ffs 
[14:24:23] <hnowlan>	 so thumbor doesn't have a memory limit of 900MB, imagemagick does and I just completely misread the variables 
[14:24:54] <hnowlan>	 we specify a MemoryLimit in the systemd service based on the memory available on the host which effectively amounts to *9GB* 
[14:25:30] <hnowlan>	 interestingly on metal we still have an occasional kill for gifsicle exeeding that but that's not relevant here 
[14:26:11] <hnowlan>	 so ehhh... might need to limit the number of instances per pod, or make some exceptions for limits maybe
[14:26:42] <hnowlan>	 I had the 3d2png command succeed in a pod of a single instance running with a 2GB memory limit but that new information seems to indicate that even that would potentially be constrained 
[14:27:53] <hnowlan>	 9GB on the old instances, 18GB on the newer ones (because we do `(@memorysize_mb.to_i * 0.15).round`)
[14:27:59] <jayme>	 ouch
[14:28:30] <jayme>	 this is per thumbor instance then, right?
[14:28:51] <jayme>	 so per thumbor container in the pod if translated to k8s
[14:29:20] <hnowlan>	 yeah
[14:29:23] <_joe_>	 I think we might need to revise how we run thumbor then 
[14:29:39] <_joe_>	 it makes sense if you can share that space between runners I guess
[14:29:45] <jayme>	 I mean you practically *can* set limits that high (with reasonable requests) but you won't be guaranteed to have that memor available
[14:30:33] <jayme>	 the space is practically shared joe
[14:30:46] <_joe_>	 ah so you mean just raise limits not requests
[14:30:49] <_joe_>	 yeah...
[14:31:06] <jayme>	 might still fail from time to time then ofc.
[14:31:13] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap fails to sync some new hosts - https://phabricator.wikimedia.org/T324023 (10Jdforrester-WMF) Probably not a scap bug but a new server config issue, then.
[14:31:44] <hnowlan>	 it does kinda make me want to jump ahead to have a single instance per pod with no haproxy though
[14:32:24] <jayme>	 that would ofc. make the distribution way easier indeed
[14:32:42] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap fails to sync some new hosts - https://phabricator.wikimedia.org/T324023 (10Joe) Please note these are not new hosts, just one of the servers listed is. So I would say either a scap bug or a network issue.
[14:32:56] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Joe)
[14:33:48] <jayme>	 hnowlan: but IIRC single instance per pod wo haproxy was no option because of the blocking nature of thumbor, right?
[14:36:03] <jayme>	 could we haproxy over N single instances instead?
[14:38:16] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) 05Open→03Resolved
[14:38:20] <wikibugs>	 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[14:45:30] <hnowlan>	 jayme: yeah not without some more aggressive health checking I guess. 
[14:46:21] <jayme>	 but if we keep haproxy? and just haproxy over multiple pods instead of instances in the pod?
[14:50:03] <hnowlan>	 as in have the haproxy running outside of k8s? 
[14:50:32] <jayme>	 could be still inside...but we would be having a deployment of haproxy and a deployment of thumbor
[14:59:35] <hnowlan>	 ahh so 1:1, yeah, that's fine and doable 
[15:01:05] <jayme>	 not sure I understand. I was thinking a deployment of N haproxy pods fronting a deployment of the same N thumbor pods
[15:02:08] <jayme>	 this is still confusing. let me try again: I was thinking a deployment of N haproxy pods fronting a deployment of the same set of thumbor pods
[15:04:34] <hnowlan>	 aha. So haproxy instances are in their own pod, and each thumbor instance is a pod, in the same namespace? 
[15:04:43] <jayme>	 yes
[15:05:46] <jayme>	 tbh I do not remember if that was ruled out initially
[15:12:21] <hnowlan>	 It wasn't afaik :)
[15:19:24] <wikibugs>	 10serviceops, 10Infrastructure-Foundations: Load IP ranges in reverse-proxy.php from Netbox/Puppet network module - https://phabricator.wikimedia.org/T324020 (10Marostegui) I think this might be more specific for these two teams.
[15:21:11] <jayme>	 if haproxy can discover backend via DNS, this should be doable by creating a headless service for the thumbor pods (by setting clusterIP: None). DNS will then return the IPs of the actual pods backing that service instead of a service IP
[15:22:02] <jayme>	 ofc. that needs to be re-evaluated often in case a pod dies etc.
[16:36:13] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) When deploying the updated calico chart to my test cluster I realized that the fixed dependency to the CRD chart does not mean tha...
[16:38:42] <wikibugs>	 10serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Upgrade jwt-authorizer on all registry hosts - https://phabricator.wikimedia.org/T324037 (10dduvall)
[16:46:35] <_joe_>	 jayme: I'd rather have a couple haproxies fronting like 10 thumbor pods
[16:46:48] <_joe_>	 haproxy can scale 1Gx more than thumbor
[16:47:01] <_joe_>	 not sure that would do what you want though
[16:47:07] <claime>	 Yeah I was wondering why the 1-1 pattern was needed
[16:47:12] <jayme>	 that's what I said, no?
[16:47:13] <_joe_>	 it's not
[16:47:25] <_joe_>	 jayme: I understood the exact opposite
[16:47:29] <claime>	 And how haproxy alleviates thumbor calls being blocking
[16:47:53] <claime>	 Yeah, same as _joe_, I understood 1 haproxy for 1 thumbor instance
[16:47:53] <jayme>	 "I was thinking a deployment of N haproxy pods fronting a deployment of the same set of thumbor pods"
[16:48:23] <jayme>	 N haproxys fronting all thumbor pods
[16:48:28] <_joe_>	 jayme: ok, no, I was saying you need 2 haproxies for N (with N > 100) thumbors
[16:48:46] <_joe_>	 "the same set" in this context seems to mean "the same number as"
[16:48:59] <_joe_>	 but you wanted to mean "the same set as previously sized"
[16:49:01] <_joe_>	 right?
[16:50:08] <jayme>	 ok. then I was not able to communicate that clearly. I wanted to say N (N being > 2) haproxy instances should front all of the replicas we have for thumbor :)
[16:50:37] <_joe_>	 so an haproxy ingress :P
[16:51:05] <jayme>	 I wanted to make sure that there is no relationship between a haproxy instance and a subset of thumbors anymore in that case
[16:51:08] <jayme>	 yes, exactly
[16:51:23] <_joe_>	 i mean it could make sense
[16:51:37] <jayme>	 a pretty dedicated ingress but yea
[16:51:49] <_joe_>	 https://haproxy-ingress.github.io/ :P
[16:51:59] <jayme>	 I wonder why we did not discuss that option initially
[16:52:06] <jayme>	 yes yes, I know
[16:52:21] <_joe_>	 i wonder if there's an option in istio to do the same as haproxy does for thumbor
[16:52:23] <jayme>	 we could also absolutely do that with istio ingress...just to be said
[16:52:45] <_joe_>	 basically we need to ensure we have 1 connection per backend pod
[16:53:26] <_joe_>	 thumbor is critical but not high volume, but I fear there's all kinds of tunables that are not there for thumbor in our istio
[16:53:31] <claime>	 What does haproxy do for thumbor, actually? Hold the backend side connection while thumbor completes the job so the caller can move on?
[16:53:51] <_joe_>	 and avoid queueing requests on a busy thumbor worker
[16:54:30] <claime>	 That screams "pull-based workflow needed" (I know it's not the discussion but still)
[16:54:32] <jayme>	 bit of header mangling as well
[16:56:00] <_joe_>	 claime: you mean queueing the thumb onto a queue and have a pool of thumbors pulling from there?
[16:56:10] <claime>	 _joe_: yeah
[16:56:12] <hnowlan>	 yep, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/thumbor/templates/_haproxy.tpl it's very simple but very particular 
[16:56:18] <_joe_>	 there's arguments for and against it
[16:56:45] <hnowlan>	 for things like the gifsicle and 3d2png jobs in particular a queue based system makes a lot of sense 
[16:57:07] <claime>	 hnowlan: because they're so long and resource-intensive?
[16:57:40] <hnowlan>	 comparatively yeah. tbh *all* jobs could fit into a queue based system imo, as larger images can be the same
[16:57:43] <_joe_>	 hnowlan: meh, it's part of a sync request flow
[16:57:46] <hnowlan>	 and if workers block when doing work anyway 
[16:58:04] <_joe_>	 it's very hard to get it right in rewrite.py
[16:58:28] <_joe_>	 basically, swift expects a sync response
[16:58:51] <_joe_>	 so the right way to do this is indeed a system that sends back a sync response
[16:59:10] <_joe_>	 so the enqueueing can happen in haproxy or in a queue, doesn't change much tbh
[16:59:27] <hnowlan>	 yeah fair
[16:59:30] <_joe_>	 because the queue should be behind haproxy and do a classical transaction queue work
[16:59:35] <_joe_>	 which haproxy already does
[16:59:38] <hnowlan>	 both are bandages on thumbor's execution model 
[16:59:41] <claime>	 fair enough
[16:59:42] <_joe_>	 yes
[17:00:34] <hnowlan>	 as regards the isitio path - is doing things like header mangling something we'd want to enable or is that offering too much complexity for what we want it to accomplish? 
[17:01:06] <jayme>	 I *think* that can already be done with istio CRDs
[17:01:33] <jayme>	 things like maxconnection (per pod) def. can using DestinationRule
[17:02:22] <jayme>	 and headers can be manipulated in the virtualservice
[17:02:43] <jayme>	 ofc. that's nothing I have implemented in the basic ingress module
[17:02:55] <hnowlan>	 so it's technically possible - how do we feel about it ideologically? I'm not aware of any services that do really complex using ingress at the moment 
[17:03:19] <jayme>	 and we would probably have to do extensive testing if it behaves like haproxy does then...
[17:03:59] <jayme>	 that's the other thing...atm. we only have a hand full of services behind ingress
[17:04:03] <jayme>	 nothing in prod path
[17:05:30] <hnowlan>	 yeah, seems like it'd need some time dedicated to it either way
[17:06:56] <jayme>	 indeed...whereas haproxy is kind of ready (and we know how it behaves). 
[17:07:09] <jayme>	 we could also to both...
[17:07:46] <jayme>	 as we're in the lucky situation of having a metal fallback anyways
[17:08:30] <_joe_>	 look,the modules/ingress directory has space for a haproxy module :P
[17:09:09] <jayme>	 ok before we add a full fledged haproxy ingress, we should do it via istio :D
[17:09:55] <claime>	 Yeah, let's maybe not do another full catalog of available proxying solutions in our k8s ingress :D
[17:11:06] <hnowlan>	 for the immediate term, is increasing limits (but not requests) to something higher off the table? 
[17:11:26] <jayme>	 if you're interested hnowlan we could probably forge a istio config in a couple of hours. you have time on fridays as I understood? :p
[17:12:00] <hnowlan>	 haha
[17:12:17] <jayme>	 whisky and ingress
[17:12:52] <jayme>	 increasing the limits should be fine as well for the time being I would say
[17:13:03] <jayme>	 so you don't get blocked by this
[17:14:09] <jayme>	 but with a gazillion of thumbor instances, we could maneuver us in a bad situation scheduling wise 
[17:14:48] <jayme>	 especially with those super-fat mw pods 
[17:15:10] <hnowlan>	 yeah I think the limits are a stopgap at best 
[17:15:53] <hnowlan>	 based on profiling current traffic those kinds of requests are quite rare, I had to make up my own calls to get stuff not in cache so asking for stuff like stl files in gif 
[17:16:43] <claime>	 Stop mediawiki-shaming, they're not fat, they just have a lot of layers. Like an ogre. Or an onion.
[17:17:12] <jayme>	 I meant ressource wise ... let's call it rich then. mw pods are ressource rich :)
[17:17:59] <claime>	 ;)
[17:25:02] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10JMeybohm) CC @BTullis & @bking: This might be relevant for operators as well.
[17:35:28] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10Patch-For-Review: Fix calico, cfssl-issuer and knative-serving Helm dependencies - https://phabricator.wikimedia.org/T303279 (10BTullis) Thanks @JMeybohm - I'll definitely bear that in mind.  From my work so far with the spark-operator it seems that the operator //its...
[21:11:23] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10Urbanecm) This happened again during the UTC late B&C deploy window:  ` 21:10:16 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deploy2002.codfw...
[22:52:49] <wikibugs>	 10serviceops, 10Release-Engineering-Team, 10Scap: scap sync failure - https://phabricator.wikimedia.org/T324023 (10dancy) >>! In T324023#8428647, @Jdforrester-WMF wrote: > Probably not a scap bug but a new server config issue, then.  A little of both in the end.