[07:16:41] inflatador: o/ [07:17:33] qq to understand - I went to /srv/deployment-charts/helmfile.d/services/cirrus-streaming-updater# on deploy2002 and with helmfile -e staging diff I see a lot of resources to be applied. Can you expand what issue you see with 'apply' ? [07:17:40] --- [07:18:08] Folks I was reviewing the KubernetesAPILatency alert, since it seems that sometimes it triggers for ml and wikikube clusters [07:18:17] temporary, it always auto-resolve [07:18:41] we alert on p95, for a period of 5m [07:20:06] what I am wondering is - what do we want to catch? temporary spikes with a lot of namespaces and resources may be ok, and chasing down all perf issues may be a daunting task. I am not suggesting to avoid to check these issues (if any), but maybe to reduce the noise and alert on sustained p95 high latency [07:20:19] (and in our case, high latency is > 500ms IIUC) [07:20:32] so we could allow for a 15m time period, for example [07:20:59] we'd leave space for temporary spikes, but alert in case something sustained appears [07:21:11] thoughts? [07:38:54] I thing we wan't to catch things like this https://phabricator.wikimedia.org/T348228 [07:43:26] jayme: ack yes, maybe then 500ms is too tight? [07:44:09] I am trying to avoid that we see multiple alerts that are known to be noise and we investigate only when something else may be correlated [07:44:18] totally plausible...I created that numbers out of thin air basically because I had no idea myself and was unable to find any reference to that is "healthy" [07:46:02] feel free to change them is what I want to say :) [07:48:18] okok! I raise this point in here to discuss a suitable target [08:04:27] I looked for "reasonable" numbers and couldn't find them :[ [08:13:36] we could start from the assumption that 500ms seems a value (for p95) that shouldn't worry us, at least if in the range of 5m [08:14:33] maybe we could have a "sustained" high latency alert (say 15/20ms more than 500ms) and another one to catch high spikes, p95 at 1s [08:15:16] we apply both, see how they suit us, and review later on [10:59:52] I think we should actually start scaling up our control planes a bit more aggressively [11:00:20] Right now (since I scaled up the wikikube masters), they're working with 2vCPU and 3GB ram [11:00:36] I think it may be a little short [11:04:03] There may be something to dig around etcd performance as well, but it gets thorny because they're VMs and already using plain disks so I wonder how much of an impact we can have there without rethinking our control plane architecture away from VMs [11:04:36] are you talking etcd or control planes? [11:05:03] AIUI you figured that etcd was causing latency spikes in API [11:05:15] I think it is, I am not sure [11:05:26] It would seem so [11:06:26] My line of thinking is that the only way we'd get more perf for etcd would be to move them to bare metal, but it seems a bit wasteful to have dedicated bare-metal just for k8s etcd, so a possible plan would be to colocate k8s cp and etcd on bare metal servers [11:06:41] Don't know if that makes sense? [11:07:33] yes, that would make sense. What does not is myself reading your messages wrong :D [11:07:54] lol [11:08:40] what I'm not sure about is: are we in trouble or is this "just alerting" [11:09:39] From what I'm seeing at least on wikikube, a scap deployment triggers enough activity on k8s cp to cause sustained ~800ms latency for almost 5 minutes [11:09:40] (I must admit that I have not read everything carefully rn and have not though about it in detail ... still going though mails and stuff from last week :/) [11:09:59] (after the cpu bump, and the etcd cpu+ram bump) [11:11:04] I think it is "just alerting" in the sense that it doesn't necessarily put us in a tough spot or an outage situation, but it shows a performance issue at the very least [11:11:32] And it's the kind of thing that will only get worse as we add nodes and pods [11:12:06] I'd be against having etcd + cp on the same nodes to be honest [11:12:09] There's precious little guidance on kubernetes control plane resource requirements [11:12:14] I like keeping concerns separated [11:12:20] yeah, okay. I wanted to understand if we need to do something rn or if we can iterate on it [11:12:22] and we can scale up them separately [11:13:03] claime: before scaling up we need to figure out how bad is 800ms of latency for scap. Does it mean that scap is delayed by X amount of seconds due to that? [11:13:11] if so it would make sense to scale it up [11:13:22] otherwise it is a matter of figuring out what acceptable latency is [11:13:23] elukey: No, scap returns when helmfile has finished applying [11:13:52] It's the "action tail" after all of the API calls have been made that causes latency [11:14:13] (at least from what I gathered the other day) [11:14:50] it is worth to investigate, my point is to understand what are the consequences of having a 5 mins of "high" latencies (if we assume that ~800ms etc.. is not good for our use case) [11:15:02] +1 [11:15:05] we are not talking about webrequest perfs, this is why I am asking [11:15:24] elukey: If you look at https://phabricator.wikimedia.org/T348228#9228864, the bump starts right after my scap test run has finished [11:15:52] claime: yep I don't doubt it, did you read what I wrote above? :D [11:16:03] that's my evidence for it being an action tail following mass delete/recreate from scap [11:16:20] elukey: yeah yeah I was just writing up evidence while reading your message x) [11:16:30] okok :) [11:16:46] I agree that we've mostly been ignoring these alerts and it hasn't caused issues [11:16:56] And so possibly they're too tight [11:17:09] and to be clear, there may be a slowdown that causes troubles that we are not aware of [11:17:34] but you example shows that we should care about sustained high latency [11:18:04] 5mins spikes are probably not a big issue, unless they are more than an acceptable limit [11:19:00] Yeah, I think it's all right to have elevated latency after having ~80 replicas deleted and recreated [11:19:36] It's shuffling a lot of stuff around [11:21:20] I'm not 100% sure I got the same timeframe where you posted the screenshots from but looking at https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?orgId=1&from=1696516650684&to=1696519437811&var-datasource=thanos&var-site=codfw&var-cluster=k8s I also see a spike in 504's [11:22:50] and something that looks loke a control plane restart in eqiad during that time [11:22:51] I think that's when I rebooted the etcd nodes and the masters [11:22:56] ah [11:22:57] Yep [11:23:10] I bumped resources so had to ganeti reboot all the control plane [11:23:32] what day are these from then? https://phabricator.wikimedia.org/T348228#9228864 [11:23:39] if you recall [11:23:44] Same day, just after [11:23:48] wait not [11:23:50] no [11:23:52] sorry [11:24:01] (lunch, ttl!) [11:24:44] I tried to match them to the scap sal log, so 2023-10-04 [11:26:14] strange. there is nothing at 16:10Z there for me :D [11:27:36] anyways...at least there are no 504's during that spikes. that is reassuring [11:28:12] wait no I'm really sorry, I have umpteen tabs open on the same ticket and getting confused [11:28:24] What you linked is the graph after my scap test [11:28:42] Which was after the bump of cpu etc [11:29:11] grafana should really put the timestamps in the exported images [11:29:26] (or I should think about linking the actual dash on top of the screenshot) [11:30:02] that breaks at some point as well unfortunately :) [11:30:21] found it now, thanks [11:31:03] Yeah but at least I'd have the dates and times in the link x) [11:31:13] indeed [11:48:44] From most of what I'm reading, for ~50 hosts (which is what we have on wikikube), recommended is 4vCPU and 16GB ram [11:49:12] And basically no matter what the size of the cluster is, no less than 4GB [11:49:58] (for cp nodes) [12:00:34] is that official values or something floating around? [12:01:10] Floating around [12:01:38] I can find no official values since they removed the recommendations for aws/gce [12:01:48] yeah [12:01:59] * claime lunch [12:02:07] ditto [13:25:25] The closest I can find to official recommandations for resources is the kubeadm install doc, which sets a bare minimum of 2 CPU and 2GB RAM [13:32:36] filed https://gerrit.wikimedia.org/r/c/operations/alerts/+/964534/ for the alert :) [13:32:47] basically just increasing the time window from 5m to 15m [13:32:54] (keeping the 500ms threshold) [13:34:27] (of course CI fails, fixing) [13:41:10] elukey: It's the test values, if you need the pointer [13:41:16] L30 and 32 I think [13:41:46] claime: yeah I am trying to decrypt what I need to add in there :D [13:41:58] elukey: x16 instead of x11 [13:41:59] not 100% straightforward [13:42:35] claime: what is the rationale? [13:42:46] I'm trying to find the doc, just a sec [13:43:01] But that's basically the length of your test timeseries [13:44:24] ok I need to work on this a bit more, will -1 myself and set ready when I have something tested/meaningful.. thanks for the help :) [13:44:42] elukey: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/#series [13:44:49] <3 [13:45:15] basically start+stepxsamples