[06:47:05] (03CR) 10Kevin Bazira: [C: 03+1] revscoring: fix exception handling in fetch_features [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939321 (owner: 10Elukey) [06:56:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] revscoring: fix exception handling in fetch_features [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939321 (owner: 10Elukey) [06:57:51] hello folks :) [06:57:57] (03CR) 10Elukey: [C: 03+2] revscoring: fix exception handling in fetch_features [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939321 (owner: 10Elukey) [06:58:32] o/ [07:04:22] (03Merged) 10jenkins-bot: revscoring: fix exception handling in fetch_features [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939321 (owner: 10Elukey) [07:11:22] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10matej_suchanek) [07:53:16] isaranto: o/ [07:53:54] o/ [07:54:06] I was revieweing the aiohttp's client session docs and IIUC, by default, when a new connection is opened the Keep alive http header is added [07:54:37] in our case, we tore down the connection every time via the context manager, but I am wondering if this is causing envoy to keep some broken connections around [07:55:01] it is a stretch but I'd test it to see if anything improves [07:55:51] basically adding connector=aiohttp.TCPConnector(force_close=True) [07:56:21] I'd also set use_dns_cache=False [07:56:45] Morning! [07:56:53] morning :) [07:57:01] elukey: alert is firing for ml-s-1007's mgmt card :-/ Investigating [07:57:13] ack [08:02:33] elukey: ack [08:02:54] Ah, DCOps already has a ticket with a whole load of hosts, all in rack F2. I suspect a switch there fell over. [08:03:05] I'll subscribe to that ticket and keep track of it [08:03:37] (03PS1) 10Elukey: ores-legacy: reduce connection handling for aiohttp [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) [08:03:48] klausman: nice! [08:03:53] ms1005 has high latency, might that be related to the stuck-ness of pods yesterday? [08:03:53] isaranto: --^ [08:04:20] since when? Some minutes or more? [08:04:34] 5-10m ago [08:04:41] on it [08:04:49] https://alerts.wikimedia.org/?q=alertname%3DKubeletOperationalLatency&q=team%3Dsre&q=%40receiver%3Ddefault [08:04:52] sry I mean Luca's patch [08:04:58] sometimes it is transient, shouldn't be related to yesterday [08:05:12] but you can check what's wrong, maybe it is a specific thing that doesn't work [08:05:15] :) [08:05:42] (my suspicion is that the alerts are a little sensitive) [08:05:46] I will do some looking around, that will surely make it go away :) [08:07:27] * elukey bbiab [08:15:50] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-legacy: reduce connection handling for aiohttp [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:19:19] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-legacy: reduce connection handling for aiohttp (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:22:39] I am attempting a refactoring of inf services repo so that we can run model servers without packaging in docker so that we can try stuff out fast [08:23:06] I'll abandon if it requires more than 1-2 h of work [08:28:29] (03CR) 10Elukey: ores-legacy: reduce connection handling for aiohttp (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:28:31] (03CR) 10Elukey: [C: 03+2] ores-legacy: reduce connection handling for aiohttp [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:29:23] (03Merged) 10jenkins-bot: ores-legacy: reduce connection handling for aiohttp [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939626 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:30:03] I hope u don't hate me after this :) [08:33:16] never <3 [08:37:06] elukey: so I have not found yet what makes 1005 slow, but it's not the only kubelet that has rather suddenly started slwoing down starting yesterday noon UTC [08:37:16] (possibly earlier, it's hard to see [08:37:18] ) [08:37:27] https://grafana.wikimedia.org/goto/CDDRpwjVz?orgId=1 [08:47:20] isaranto: it seems to have improved things a lot! [08:47:40] aha! that's nice! [08:47:53] is it deployed? can I check? [08:48:12] isaranto: in staging yes [08:48:33] did u see the same issues in staging before? [08:48:41] I did yes [08:48:46] ok! [08:48:52] I tested it only with goodfaith and a lot of rev ids [08:49:04] my test call now is better [08:49:13] not sure though the huge one in the task, haven't tried it yet [08:49:45] klausman: nice finding! Does it match with us deploying? Because we rolled out the new concurrency stuff yesterday IIRC for knative [08:50:50] yeah 1005 seems the most affected one [08:51:02] lemme check the eqiad pods [08:53:17] klausman: ah wow on 1005 dmesg looks really messy [08:53:19] sigh [08:56:55] isaranto: super big queries still lead to some 503s [08:57:18] but it seems that we are doing better afaics [08:57:31] elukey: yeah, lots of OOM kills. [08:57:58] Though the machine is not under memory pressure right now. [08:58:11] I wonder if the OOM kills got it into a wonky state [08:58:22] yeah I'd restart the kubelet to be honest [08:58:33] already done that, also kubeproxy, no effect [08:59:25] let's be careful in restarting those daemons, kubelet should be safe to restart, not sure if kubeproxy needs a drain first etc.. [08:59:29] Also note the ethernet flaps [09:00:23] kubeproxy is fine to restart as well, AIUI [09:01:09] Also, it seems the other kubelets are starting to go up in latency as well right now [09:01:20] there are also SATA link down events [09:02:54] those are boot messages [09:02:57] nothing reported in racadm's getsel [09:03:28] The machine booted on May 19, 13:53, and the SATA link messages are from that time (dmesg -T) [09:03:40] ahhh right my pebcak, I've read 19 and not the month [09:04:16] NETDEV_CHANGE _may_ be caused by external factors, but I've never seen a link flap [09:04:46] oh dear, the API latency is skyrocketing. ?1s now [09:04:51] >1s [09:08:03] ? [09:08:14] https://grafana.wikimedia.org/goto/TdBU2Qj4k?orgId=1 [09:08:21] It has since recovered. [09:08:28] But likely not completely [09:08:38] that's not the API though :) [09:08:56] ok, op latency [09:09:25] ok so let's start from what we know [09:09:43] some kubelets started to show hugh latencies since yesterday around 9:20 UTC [09:09:46] what changed? [09:09:54] Starting yesterday around/just before noon, Kubelt op latency is much higher than before. [09:09:55] I'm seeing some issues on revscoring-editquality-goodfaith namespace on eqiad. There are some enwiki pods that fail witha a backoff. It seems the same issue as yesterday they are using an older revision [09:10:51] ok let's also check the events there, maybe k8s is giving us some indication [09:10:51] There are some pods that have ~200 restarts, and one is in crashloop backoff (6200 restarts) [09:11:01] this is more promising, what pods? [09:11:13] nllb-200-predictor-default-00005-deployment-6c68c75bcf-chtnc is the crashloop backoff one [09:11:34] The rest are all enwiki-goodfaith-predictor-default [09:11:46] one enwiki-drafttopic-predictor-default is in terminating state [09:12:09] All pods are on diverse machines, except the nllb and one of the goodfaith that are both on 1001 [09:12:39] the most worrying ones are the istio gateway restarts in my opinion [09:13:15] Yes, those are slower (freq, but more impactful [09:13:16] klausman: let's start cleaning up the goodfaith ones as isaranto suggested [09:14:34] isaranto started the deployments at around 09:16 UTC yesterday, so the new code is definitely related https://sal.toolforge.org/log/SJ1JaIkBxE1_1c7scKjt [09:15:21] and most of the high latencies (as seen in https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets?var-cluster=eqiad%20prometheus%2Fk8s-mlserve&orgId=1&from=now-2d&to=now&refresh=30s&viewPanel=25) seem to be related to "stop container" [09:15:59] TIL about sal.toolforge.org [09:16:30] I've taken a quick look at the istio pod logs, but nothing immediately obvious is wrong there [09:17:29] What's the cleanup procedure for the goodfaith pods? [09:18:13] klausman: this is the corner case of old knative revisions getting stuck, it may be a bug in our version. What I usually do is "kubectl get revision -n $namespace" and then delete the correspondent one [09:18:24] you can find the messy revision checking the name of the pod [09:19:17] I see 8 9 and 10 [09:19:40] But only 9 and 10 have desired replicas=4 [09:19:47] 9 is the messy one in this case [09:20:05] on ml-serve1005 [09:20:05] Jul 19 09:16:14 ml-serve1005 kubelet[3136643]: E0719 09:16:14.417212 3136643 remote_runtime.go:479] "StopContainer from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="6b89b3c157d64bbec91e96e7d7b9c6c51ac47a2120719906fda0bf487b308396" [09:20:09] that is probably the issue. 9 shouldn't have desired replicas [09:20:48] So `kubectl delete revision -n revscoring-editquality-goodfaith enwiki-goodfaith-predictor-default-00009` ? [09:21:01] yep [09:21:14] Ok, it's gone [09:21:37] It's now in state terminating [09:22:14] One pod left [09:24:53] The question is: how did we get into this state, and why does it wreak havoc on kubelet op latency [09:25:27] (the latter is probably due to the DELET being slow, and frequent crashlooping making it the dominating operation, staistically) [09:26:12] we know from the graph above that the op latency issues are related to the stop container [09:26:26] (last pod gone) [09:26:53] The op latency graph _looks_ recovered, but it will take another 15m-30m to be sure. [09:26:58] isaranto: nllb status looks weird, seems failing to bootstrap with the GPU, any experiment ongoing over there? [09:28:00] need to run for an errand, bbl! [09:30:08] I'm taking a look, dont really remember the state I left it [09:33:21] (false alert, back :) [09:34:36] There are two revisisons for nllb: rev 2 and rev 5 [09:34:38] yes, I had left nllb hanging while experimenting with quantization [09:34:39] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939637 [09:34:43] rev2 is running, rev5 is in Error [09:34:57] I set quantization to false and also removed the gpu for now [09:35:43] sry folks totally forgot about that one [09:36:21] np, it is surely not the problem [09:36:53] it started when you did all deployments yesterday, so I guess that it is either the knative change or maybe the broad deployments that set some weird bug off [09:41:53] klausman: as you mentioned latencies seem to have recovered a bit, I still don't explain the OOMs that we found though, maybe unrelated but worrying [09:41:58] and also all the gateway restarts [09:44:10] I deployed the new changes in nllb [09:44:52] (03PS1) 10Elukey: ores-legacy: set connection limit to 0 (unlimited) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) [09:45:13] it is running fine now but that also has some old revision, along with bloom-560 [09:45:25] elukey: yeah, Latency looks good. As for the ooms, I'lls if there's a pattern to the processes killed, and if it happened on any of the other machines [09:46:18] (03PS2) 10Elukey: ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) [09:46:47] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [09:47:08] (03CR) 10CI reject: [V: 04-1] ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [09:47:38] uff I ran tix [09:47:40] *tox [09:47:58] (03PS3) 10Elukey: ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) [09:48:28] All machines have some OOMs in their dmesg, typically 2-3 a day, and the majority of them, the process name is envoy [09:49:10] thanks for the review isaranto <3 [09:49:21] (03CR) 10Elukey: [C: 03+2] ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [09:49:33] A typical dmegs entry is `[Wed Jul 19 08:58:53 2023] Memory cgroup out of memory: Killed process 4057687 (envoy)` [09:49:51] plsu some memstats. Unfortunately, that doesn't tell us what specific envoy/pod it was [09:50:12] (03Merged) 10jenkins-bot: ores-legacy: set connection limit to unlimited [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939639 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [09:50:19] klausman: https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=istio-system&var-pod=istio-ingressgateway-c7z5j&var-container=All [09:50:23] sigh [09:50:32] it takes nothing to get ooms in this stage [09:50:37] *state [09:50:50] the limit is too low [09:50:52] Question is: are we under-provisioning or is there a memory problem in istio [09:51:36] Looking at the history, istio goes from start to 90+% of mem usage very quickly. But modern Go programs tend to do that. [09:51:57] (assuming they are setup that way) [09:52:20] elukey: do u think 503 errors will go await by setting limit to inf? I doubt that we reached the limit of 100 [09:52:43] regardless though it is a good change - to not control things on lw side [09:52:43] elukey: should we bump the mmemory allowance for the ingress GWs? [09:52:52] klausman: +1 yes [09:52:59] I'll make a patch [09:53:43] isaranto: so the only 503s that I get is when I issue the big request that you added in the task, and judging from the code we loop through models and rev_ids within the aiohttp's Client session [09:54:18] rev-ids * 4 models is more than 100 [09:54:33] does it make sense? [09:55:32] yes, but they go to different pods/model-ervers [09:56:33] sure sure, but limit's doc says "total number simultaneous connections", IIUC this is a setting for each ClientSession [09:56:47] so it is agnostic to the targets [09:56:59] what it cares about is how many tcp connections we open [09:57:06] nevermind I got confused [09:57:20] nono please let's discuss, I am not 100% sure yet [09:57:23] I was thinking server side (lw) [09:57:33] ahh okok [09:57:41] I think we should apply these settings to all model servers [09:57:51] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939640/ [09:58:04] I mean it makes sense all connections are from ores-legacy. I see 49 revids and 4 models ~=200 requests [09:58:13] yeah exactly [09:58:38] and a brutal 503 seems to fit the picture if you reach that number [09:58:58] but! shouldnt we have the same issue when I run this on statbox? [09:59:48] in theory no since you don't have the envoy local proxy [09:59:59] you connect directly to lift wing, so no interference [10:00:05] and aiohttp is free to handle is conn [10:00:14] (the 503s come from the ores-legacy's tls proxy_ [10:01:19] ok ok, thnx for clarifying [10:02:06] elukey: I am having trouble finding the right file/spot for the quota increase. helmfile.d/admin_ng/values/ml-serve.yaml seemed right, but I'm not sure anymore [10:02:34] klausman: ah yes with istio is a little bit more complex, since we use a separate manifeste [10:02:40] *manifest [10:02:56] it is in custom.d, and I think there should be an option to tune the memory for the gateway pods [10:03:01] if not we'd need to check in their docs [10:03:55] custom_deploy.d? [10:04:19] yes exactly [10:04:44] I am going to lunch with Filippo today, so I am going afk earlier for lunch, ttl! [10:04:49] * elukey lunch [10:05:00] \o do say hello to Filippo :) [10:50:23] Aaand latency on 1005 (but only there) is up again. Investigating. [10:53:40] Something weird is going on with enwiki-drafttopic-predictor [10:54:13] desired replicas is 1, but there is one running (25h old) and another terminating. And a few mins ago there was yet another in init. [10:54:27] The currently terminating one is 2m45s odl. [10:54:34] And they're all rev 9 [10:54:50] (I am focusing on that pod since it's running on 1005) [10:57:37] aand now there are three again. Three pods, that is not 3/3 containers. [11:01:46] elukey: See above. Something weird is going on there. Ther eis a 25h old pod in steady state, and then two more that are cycling through init/termination all the time. They're all rev 9, so it isn't the same stuff me had earlier. [11:03:14] oooh, there is a desired replicas=3 [11:03:33] And how we have three running and a fourth terminating. [11:03:41] Is this maybe autoscaling being twitchy? [11:04:09] Yes, desired replica is going up and down a lot. [11:11:38] I see they are all from the latest revision. desired replica shouldn't be 3 I guess [11:11:47] https://www.irccloud.com/pastebin/tP9iv0P2/ [11:11:57] I think it might be autoscaling [11:12:24] what happens if you edit the inference service for drafttopic and set maxReplicas to 1? can u check if that would resolve it? [11:12:26] if it is, maybe it should keep started pods around for longer [11:13:11] The other question is if this up/down of replicas is actually a problem (and if the op latency alert is too sensitive) [11:13:30] I'll wait for Luca to come back before I fiddle with things. Also: lunch :) [11:14:13] ack [11:14:23] I think the issue may come from `autoscaling.knative.dev/target: "3"` [11:19:04] although again this refers to the number of concurrent requests that should trigger a scaleup [11:19:58] it is quite confusing and I forgot even what we were discussing yesterday :P [11:22:17] concurrent requests vs replicas. So knative will create a new pod if we have more than 3 concurrent requests in 1 pod and that will happen max until we reach 3 pods. all set by these annotations [11:22:17] ``` [11:22:17] autoscaling.knative.dev/max-scale: 3 [11:22:17] autoscaling.knative.dev/min-scale: 1 [11:22:17] autoscaling.knative.dev/target: 3 [11:22:18] ``` [11:22:39] I'm just writing down my conclusions to make sure I'm getting it right [11:22:48] * isaranto goes for lunch [11:25:37] It sounds about right [11:57:10] checking [12:00:02] klausman: one useful source of info is "kubectl get events -n etc.." [12:00:49] ah, good point [12:01:02] some issues seems to be caused by the readiness probe getting a 503 [12:02:22] I made a patch for the istio memory, feel free to review and merge, I gtg to a doc appt. Should be back before the meetings [12:02:55] ack [12:25:03] applying your changes :) [12:27:02] all right all good, I see the changes applied etc.. [12:27:16] so we'll keep the restarts monitored [12:56:07] isaranto: I think I was doing wrong tests before, the situation is a little better now but the big query in staging fails with a lot of timeouts [12:56:10] sigh [12:56:29] * isaranto sighs as well [12:58:02] I forgot about https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@c823129&_a=h@780154f, but we have access logs breakdown at the istio gateway level [12:59:58] if I isolate the Ores legacy UA, I don't see many requests to the istio gateway pods in ml-serve-codfw (we use the tls local proxy that calls production, even from staging) [13:00:09] and all of them are 200 [13:00:15] isaranto: hey, which ores stuff is blocked on me? [13:00:33] (i.e. what can I do to help?) [13:01:34] Amir1: I'm planning on rolling this out to all wikis leaving enwiki last. [13:01:59] I can create the patch(es) and you could deploy since I can't do that part [13:03:56] and we need a checklist for things to check after each deployment to make sure things work. e.g. run a job on mwmaint, check database entries , recentchanges filters etc [13:03:57] wdyt? [13:10:21] sure, let's start with one small wiki and move forward [13:14:20] (03PS1) 10Ilias Sarantopoulos: refactor: allow relative imports in repo [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939683 [13:14:27] (03CR) 10CI reject: [V: 04-1] refactor: allow relative imports in repo [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939683 (owner: 10Ilias Sarantopoulos) [13:15:54] the above is WIP, plz disregard, I have an issue with appending to PYTHONPATH with blubber and reached out to releng [13:18:14] Amir1: any good candidates from the table here -> https://phabricator.wikimedia.org/T342115? [13:19:32] first eswikibooks and eswikiquote. Then hewiki and itwiki (they are goup1, as an early adapter), ... [13:28:14] isaranto: trying to rollback the limit change, and see if it improves [13:28:18] (only in deployment-charts) [13:28:54] ok! [13:57:43] https://www.irccloud.com/pastebin/UTL7uNMM/ [14:38:59] elukey: for reference, get events cmdline that works better than defaults: kubectl get events -w -o custom-columns=Time:.firstTimestamp,Component:.source.component,Type:.type,Reason:.reason,Message:.message -n revscoring-drafttopic [14:45:52] okok thanks [14:54:41] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939719 and next to add the scale down delay to knative [14:54:44] :) [14:54:54] (already applied it manually in eqiad) [14:55:11] latencies are still weird, but one poid is in terminating state [15:07:50] LGTM for 939719 [15:18:15] lol I just realized from the ci's diff that we have some logging config [15:18:24] never really applied since "config_logging" wasn't there [15:18:39] basically [15:18:40] /home/elukey/Wikimedia/deployment-charts/charts/knative-serving/values.yaml [15:18:43] sorry [15:18:46] config_logging: [15:18:46] loglevel.controller: "warn" [15:18:47] loglevel.autoscaler: "warn" [15:18:47] etc.. [15:19:04] it is in the knative's chart values.yaml already [15:19:31] it should be ok to apply, so we remove some weird logs [15:23:08] 10Machine-Learning-Team: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10isarantopoulos) [15:27:37] elukey: nice catch [15:28:57] klausman: nice pebcak of Luca from the past I'd say :D [15:29:18] pretty sure I added it [15:29:21] sigh [15:29:26] it's all glass-half-empty vs full etc :) [15:29:55] sure sure :) [15:30:02] I am deploying the knative changes to all clusters [15:31:47] :+1: [15:32:28] aaand it fails [15:33:07] afaics changing the log settings to all pods knocks the webhook down, so the new yalm is not validated [15:33:11] and it is rejected [15:33:22] * elukey cries in corner [15:33:52] That seems like a serious design byg [15:38:44] filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939733 [15:38:48] as workaround [15:38:51] (for the moment) [15:39:31] lgtm'd [15:49:24] danke [16:03:49] 10Machine-Learning-Team: Append PYTHONPATH in blubber - https://phabricator.wikimedia.org/T342273 (10isarantopoulos) [16:04:16] o/ I added a task for a blubber improvement relevant to the refactoring I tried today [16:11:10] nice! [16:15:49] I've already opened the MR, but I'll just test it first https://gitlab.wikimedia.org/repos/releng/blubber/-/merge_requests/49 [16:30:07] klausman: ahhh ok the webhook thing is due to [16:30:07] Readiness probe failed: HTTP probe failed with statuscode: 503 [16:30:15] so the new pods are probably slow to come up [16:31:32] Liveness: http-get https://:8443/ delay=120s timeout=1s period=1s #success=1 #failure=6 [16:31:35] Readiness: http-get https://:8443/ delay=0s timeout=1s period=1s #success=1 #failure=3 [16:31:38] yeah this needs to change [16:31:50] Ah, so we need to increas delay (and possibly timeout) for readiness? [16:31:57] exactly yes [16:34:07] I'm battling with helm in this patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939744 [16:34:40] I'm logging off and will look at it with fresh eyes and mind tomorrow [16:34:57] logging off for the day [16:35:45] me too :) [16:35:52] \o [16:38:08] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939745 should fix the readiness probe [16:39:13] That verified+1 is a misclick, but LGTM! [16:48:03] so weird, still doesn't work [16:48:13] I'll try to work on it tomorrow :) [16:48:14] o/