[07:16:41] <elukey>	 inflatador: o/
[07:17:33] <elukey>	 qq to understand - I went to /srv/deployment-charts/helmfile.d/services/cirrus-streaming-updater# on deploy2002 and with helmfile -e staging diff I see a lot of resources to be applied. Can you expand what issue you see with 'apply' ?
[07:17:40] <elukey>	 ---
[07:18:08] <elukey>	 Folks I was reviewing the KubernetesAPILatency alert, since it seems that sometimes it triggers for ml and wikikube clusters
[07:18:17] <elukey>	 temporary, it always auto-resolve
[07:18:41] <elukey>	 we alert on p95, for a period of 5m
[07:20:06] <elukey>	 what I am wondering is - what do we want to catch? temporary spikes with a lot of namespaces and resources may be ok, and chasing down all perf issues may be a daunting task. I am not suggesting to avoid to check these issues (if any), but maybe to reduce the noise and alert on sustained p95 high latency
[07:20:19] <elukey>	 (and in our case, high latency is > 500ms IIUC)
[07:20:32] <elukey>	 so we could allow for a 15m time period, for example
[07:20:59] <elukey>	 we'd leave space for temporary spikes, but alert in case something sustained appears
[07:21:11] <elukey>	 thoughts?
[07:38:54] <jayme>	 I thing we wan't to catch things like this https://phabricator.wikimedia.org/T348228
[07:43:26] <elukey>	 jayme: ack yes, maybe then 500ms is too tight?
[07:44:09] <elukey>	 I am trying to avoid that we see multiple alerts that are known to be noise and we investigate only when something else may be correlated
[07:44:18] <jayme>	 totally plausible...I created that numbers out of thin air basically because I had no idea myself and was unable to find any reference to that is "healthy"
[07:46:02] <jayme>	 feel free to change them is what I want to say :)
[07:48:18] <elukey>	 okok! I raise this point in here to discuss a suitable target
[08:04:27] <claime>	 I looked for "reasonable" numbers and couldn't find them :[
[08:13:36] <elukey>	 we could start from the assumption that 500ms seems a value (for p95) that shouldn't worry us, at least if in the range of 5m
[08:14:33] <elukey>	 maybe we could have a "sustained" high latency alert (say 15/20ms more than 500ms) and another one to catch high spikes, p95 at 1s
[08:15:16] <elukey>	 we apply both, see how they suit us, and review later on
[10:59:52] <claime>	 I think we should actually start scaling up our control planes a bit more aggressively
[11:00:20] <claime>	 Right now (since I scaled up the wikikube masters), they're working with 2vCPU and 3GB ram
[11:00:36] <claime>	 I think it may be a little short
[11:04:03] <claime>	 There may be something to dig around etcd performance as well, but it gets thorny because they're VMs and already using plain disks so I wonder how much of an impact we can have there without rethinking our control plane architecture away from VMs
[11:04:36] <jayme>	 are you talking etcd or control planes?
[11:05:03] <jayme>	 AIUI you figured that etcd was causing latency spikes in API
[11:05:15] <claime>	 I think it is, I am not sure
[11:05:26] <claime>	 It would seem so
[11:06:26] <claime>	 My line of thinking is that the only way we'd get more perf for etcd would be to move them to bare metal, but it seems a bit wasteful to have dedicated bare-metal just for k8s etcd, so a possible plan would be to colocate k8s cp and etcd on bare metal servers
[11:06:41] <claime>	 Don't know if that makes sense?
[11:07:33] <jayme>	 yes, that would make sense. What does not is myself reading your messages wrong :D
[11:07:54] <claime>	 lol
[11:08:40] <jayme>	 what I'm not sure about is: are we in trouble or is this "just alerting"
[11:09:39] <claime>	 From what I'm seeing at least on wikikube, a scap deployment triggers enough activity on k8s cp to cause sustained ~800ms latency for almost 5 minutes
[11:09:40] <jayme>	 (I must admit that I have not read everything carefully rn and have not though about it in detail ... still going though mails and stuff from last week :/)
[11:09:59] <claime>	 (after the cpu bump, and the etcd cpu+ram bump)
[11:11:04] <claime>	 I think it is "just alerting" in the sense that it doesn't necessarily put us in a tough spot or an outage situation, but it shows a performance issue at the very least
[11:11:32] <claime>	 And it's the kind of thing that will only get worse as we add nodes and pods
[11:12:06] <elukey>	 I'd be against having etcd + cp on the same nodes to be honest
[11:12:09] <claime>	 There's precious little guidance on kubernetes control plane resource requirements
[11:12:14] <elukey>	 I like keeping concerns separated
[11:12:20] <jayme>	 yeah, okay. I wanted to understand if we need to do something rn or if we can iterate on it
[11:12:22] <elukey>	 and we can scale up them separately
[11:13:03] <elukey>	 claime: before scaling up we need to figure out how bad is 800ms of latency for scap. Does it mean that scap is delayed by X amount of seconds due to that?
[11:13:11] <elukey>	 if so it would make sense to scale it up
[11:13:22] <elukey>	 otherwise it is a matter of figuring out what acceptable latency is
[11:13:23] <claime>	 elukey: No, scap returns when helmfile has finished applying
[11:13:52] <claime>	 It's the "action tail" after all of the API calls have been made that causes latency
[11:14:13] <claime>	 (at least from what I gathered the other day)
[11:14:50] <elukey>	 it is worth to investigate, my point is to understand what are the consequences of having a 5 mins of "high" latencies (if we assume that ~800ms etc.. is not good for our use case)
[11:15:02] <jayme>	 +1
[11:15:05] <elukey>	 we are not talking about webrequest perfs, this is why I am asking
[11:15:24] <claime>	 elukey: If you look at https://phabricator.wikimedia.org/T348228#9228864, the bump starts right after my scap test run has finished
[11:15:52] <elukey>	 claime: yep I don't doubt it, did you read what I wrote above? :D
[11:16:03] <claime>	 that's my evidence for it being an action tail following mass delete/recreate from scap
[11:16:20] <claime>	 elukey: yeah yeah I was just writing up evidence while reading your message x)
[11:16:30] <elukey>	 okok :)
[11:16:46] <claime>	 I agree that we've mostly been ignoring these alerts and it hasn't caused issues
[11:16:56] <claime>	 And so possibly they're too tight
[11:17:09] <elukey>	 and to be clear, there may be a slowdown that causes troubles that we are not aware of
[11:17:34] <elukey>	 but you example shows that we should care about sustained high latency 
[11:18:04] <elukey>	 5mins spikes are probably not a big issue, unless they are more than an acceptable limit
[11:19:00] <claime>	 Yeah, I think it's all right to have elevated latency after having ~80 replicas deleted and recreated
[11:19:36] <claime>	 It's shuffling a lot of stuff around
[11:21:20] <jayme>	 I'm not 100% sure I got the same timeframe where you posted the screenshots from but looking at https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=1696516650684&to=1696519437811&var-datasource=thanos&var-site=codfw&var-cluster=k8s I also see a spike in 504's
[11:22:50] <jayme>	 and something that looks loke a control plane restart in eqiad during that time
[11:22:51] <claime>	 I think that's when I rebooted the etcd nodes and the masters
[11:22:56] <jayme>	 ah
[11:22:57] <claime>	 Yep
[11:23:10] <claime>	 I bumped resources so had to ganeti reboot all the control plane
[11:23:32] <jayme>	 what day are these from then? https://phabricator.wikimedia.org/T348228#9228864 
[11:23:39] <jayme>	 if you recall
[11:23:44] <claime>	 Same day, just after
[11:23:48] <claime>	 wait not
[11:23:50] <claime>	 no
[11:23:52] <claime>	 sorry
[11:24:01] <elukey>	 (lunch, ttl!)
[11:24:44] <claime>	 I tried to match them to the scap sal log, so 2023-10-04
[11:26:14] <jayme>	 strange. there is nothing at 16:10Z there for me :D
[11:27:36] <jayme>	 anyways...at least there are no 504's during that spikes. that is reassuring
[11:28:12] <claime>	 wait no I'm really sorry, I have umpteen tabs open on the same ticket and getting confused
[11:28:24] <claime>	 What you linked is the graph after my scap test
[11:28:42] <claime>	 Which was after the bump of cpu etc
[11:29:11] <claime>	 grafana should really put the timestamps in the exported images
[11:29:26] <claime>	 (or I should think about linking the actual dash on top of the screenshot)
[11:30:02] <jayme>	 that breaks at some point as well unfortunately :)
[11:30:21] <jayme>	 found it now, thanks
[11:31:03] <claime>	 Yeah but at least I'd have the dates and times in the link x)
[11:31:13] <jayme>	 indeed
[11:48:44] <claime>	 From most of what I'm reading, for ~50 hosts (which is what we have on wikikube), recommended is 4vCPU and 16GB ram
[11:49:12] <claime>	 And basically no matter what the size of the cluster is, no less than 4GB
[11:49:58] <claime>	 (for cp nodes)
[12:00:34] <jayme>	 is that official values or something floating around?
[12:01:10] <claime>	 Floating around
[12:01:38] <claime>	 I can find no official values since they removed the recommendations for aws/gce
[12:01:48] <jayme>	 yeah
[12:01:59] * claime lunch
[12:02:07] <jayme>	 ditto
[13:25:25] <claime>	 The closest I can find to official recommandations for resources is the kubeadm install doc, which sets a bare minimum of 2 CPU and 2GB RAM
[13:32:36] <elukey>	 filed https://gerrit.wikimedia.org/r/c/operations/alerts/+/964534/ for the alert :)
[13:32:47] <elukey>	 basically just increasing the time window from 5m to 15m
[13:32:54] <elukey>	 (keeping the 500ms threshold)
[13:34:27] <elukey>	 (of course CI fails, fixing)
[13:41:10] <claime>	 elukey: It's the test values, if you need the pointer
[13:41:16] <claime>	 L30 and 32 I think
[13:41:46] <elukey>	 claime: yeah I am trying to decrypt what I need to add in there :D
[13:41:58] <claime>	 elukey: x16 instead of x11
[13:41:59] <elukey>	 not 100% straightforward
[13:42:35] <elukey>	 claime: what is the rationale?
[13:42:46] <claime>	 I'm trying to find the doc, just a sec
[13:43:01] <claime>	 But that's basically the length of your test timeseries
[13:44:24] <elukey>	 ok I need to work on this a bit more, will -1 myself and set ready when I have something tested/meaningful.. thanks for the help :)
[13:44:42] <claime>	 elukey: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#series
[13:44:49] <elukey>	 <3
[13:45:15] <claime>	 basically start+stepxsamples