[08:13:11] cdanis: in case you did not figure out already: That usually means that containers do not start (crashloop) and/or defined readiness probes don't go green within time. You may check the event logs in the related namespace for details [11:32:41] jayme: yeah I checked the event log and didn’t see anything other than notices and a few warnings. Also the error happened after fully upgrading all the pods. But I’ll try again once I’m online in an hour or so [13:23:21] cdanis: lmk if you'd like me to take a look [13:25:07] jayme: oddly, there's no longer a diff on codfw [13:25:17] oopsie [13:25:27] I don't know what happened :) [13:25:51] does helmfile `--debug` change behavior? [13:26:01] I guess this was about the otel collector? [13:26:03] yes [13:26:33] AFAIK --debug does not change behavior, it just produeces wall of text [13:26:38] or better wall of yaml :) [13:27:05] well it seemed to have, I didn't get a rolling update or anything [13:27:06] but has your change been applied to codfw? [13:28:07] I don't see a diff for eqiad as well [13:29:13] yeah, I ran --debug apply on eqiad, and uh, there is no more diff either afterwards [13:29:17] and it completed very quickly [13:29:21] ah, you did apply there as well. So what you say is it did not work without --debug and it did with --debug [13:29:28] hmm [13:29:34] maybe --debug disables --atomic [13:29:45] at the same time, we don't have any traces with the fixed service names [13:29:48] which did happen yesterday [13:31:10] oh [13:31:15] it's still in the process [13:31:20] main-opentelemetry-collector-agent 187 187 186 60 186 306d [13:31:23] yeah [13:31:27] I just saw that as well [13:31:42] so I'd say --debug disables --wait and/or --atomic [13:32:08] it was --atomic causing the issue yesterday btw, and after I watched all ~187 pods restart in the events [13:37:19] so I guess, next question, what events are sufficient for helm to have considered the release to have failed? [13:37:29] 53s Warning FailedToUpdateEndpoint endpoints/main-opentelemetry-collector Failed to update endpoint opentelemetry-collector/main-opentelemetry-collector: Operation cannot be fulfilled on endpoints "main-opentelemetry-collector": the object has been modified; please apply your changes to the latest version and try again [13:37:31] [13:37:33] 35s Warning FailedKillPod pod/main-opentelemetry-collector-agent-slhdm error killing pod: failed to "KillContainer" for "opentelemetry-collector" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: No such container: ff2950b19c6d7cd9db2c206b9e25f76024957c4587d271ec2670318859320d25" [13:37:35] [13:37:39] hopefully not these? because they seem like noise [13:39:46] meeting, will read in a bit [13:41:23] np [13:43:49] those are noise indeed [13:52:22] maybe it's just that it takes to long.... with --atomic/--wait comes a timeout in which helm/helmfile expects the rollout to finish [13:53:15] that is 5min by default, maybe rolling all the pods takes to long. I did not check the updatestrategy but if it's replace, it will do one-by-one IIRC [13:57:14] in that case you would not see any suspicious events obviously, as everything is fine k8s wise [13:58:27] jayme: it's rollingUpdate but maxUnavailable is 1, so it's still one-by-one [13:58:45] and yeah I suspect it was that high-level timeout, which I am also guessing is only checked at the very end [13:58:45] ah, then I'd place my bet on the timeout :) [13:59:07] I'm also going to change maxUnavailable to be a few rather than just 1 :) [13:59:14] how is the timeout configured? [13:59:22] yeah - it's basically a context with a timeout. If the rollout is not done withing that time, it's considered bad and the rollback happens [13:59:31] top of helmfile.yaml [13:59:34] ah [13:59:36] thanks :) [13:59:49] yw [14:15:35] ah, that timeout is 600 IIRC and we already had to alter it for the other daemonset, calico. [14:15:55] cluster is getting bigger [14:22:24] yeah indeed