[08:13:11] <jayme>	 cdanis: in case you did not figure out already: That usually means that containers do not start (crashloop) and/or defined readiness probes don't go green within time. You may check the event logs in the related namespace for details
[11:32:41] <cdanis>	 jayme: yeah I checked the event log and didn’t see anything other than notices and a few warnings. Also the error happened after fully upgrading all the pods. But I’ll try again once I’m online in an hour or so
[13:23:21] <jayme>	 cdanis: lmk if you'd like me to take a look
[13:25:07] <cdanis>	 jayme: oddly, there's no longer a diff on codfw
[13:25:17] <jayme>	 oopsie
[13:25:27] <cdanis>	 I don't know what happened :)
[13:25:51] <cdanis>	 does helmfile `--debug` change behavior?
[13:26:01] <jayme>	 I guess this was about the otel collector?
[13:26:03] <cdanis>	 yes
[13:26:33] <jayme>	 AFAIK --debug does not change behavior, it just produeces wall of text
[13:26:38] <jayme>	 or better wall of yaml :)
[13:27:05] <cdanis>	 well it seemed to have, I didn't get a rolling update or anything
[13:27:06] <jayme>	 but has your change been applied to codfw?
[13:28:07] <jayme>	 I don't see a diff for eqiad as well
[13:29:13] <cdanis>	 yeah, I ran --debug apply on eqiad, and uh, there is no more diff either afterwards
[13:29:17] <cdanis>	 and it completed very quickly
[13:29:21] <jayme>	 ah, you did apply there as well. So what you say is it did not work without --debug and it did with --debug
[13:29:28] <jayme>	 hmm
[13:29:34] <jayme>	 maybe --debug disables --atomic
[13:29:45] <cdanis>	 at the same time, we don't have any traces with the fixed service names
[13:29:48] <cdanis>	 which did happen yesterday
[13:31:10] <cdanis>	 oh
[13:31:15] <jayme>	 it's still in the process
[13:31:20] <jayme>	 main-opentelemetry-collector-agent   187       187       186     60           186         <none>          306d
[13:31:23] <cdanis>	 yeah
[13:31:27] <cdanis>	 I just saw that as well
[13:31:42] <jayme>	 so I'd say --debug disables --wait and/or --atomic
[13:32:08] <cdanis>	 it was --atomic causing the issue yesterday btw, and after I watched all ~187 pods restart in the events
[13:37:19] <cdanis>	 so I guess, next question, what events are sufficient for helm to have considered the release to have failed?
[13:37:29] <cdanis>	 53s         Warning   FailedToUpdateEndpoint   endpoints/main-opentelemetry-collector         Failed to update endpoint opentelemetry-collector/main-opentelemetry-collector: Operation cannot be fulfilled on endpoints "main-opentelemetry-collector": the object has been modified; please apply your changes to the latest version and try again                                                             
[13:37:31] <cdanis>	              
[13:37:33] <cdanis>	 35s         Warning   FailedKillPod            pod/main-opentelemetry-collector-agent-slhdm   error killing pod: failed to "KillContainer" for "opentelemetry-collector" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: No such container: ff2950b19c6d7cd9db2c206b9e25f76024957c4587d271ec2670318859320d25"                                                             
[13:37:35] <cdanis>	              
[13:37:39] <cdanis>	 hopefully not these? because they seem like noise
[13:39:46] <jayme>	 meeting, will read in a bit
[13:41:23] <cdanis>	 np
[13:43:49] <akosiaris>	 those are noise indeed
[13:52:22] <jayme>	 maybe it's just that it takes to long.... with --atomic/--wait comes a timeout in which helm/helmfile expects the rollout to finish
[13:53:15] <jayme>	 that is 5min by default, maybe rolling all the pods takes to long. I did not check the updatestrategy but if it's replace, it will do one-by-one IIRC
[13:57:14] <jayme>	 in that case you would not see any suspicious events obviously, as everything is fine k8s wise
[13:58:27] <cdanis>	 jayme: it's rollingUpdate but maxUnavailable is 1, so it's still one-by-one
[13:58:45] <cdanis>	 and yeah I suspect it was that high-level timeout, which I am also guessing is only checked at the very end
[13:58:45] <jayme>	 ah, then I'd place my bet on the timeout :)
[13:59:07] <cdanis>	 I'm also going to change maxUnavailable to be a few rather than just 1 :)
[13:59:14] <cdanis>	 how is the timeout configured?
[13:59:22] <jayme>	 yeah - it's basically a context with a timeout. If the rollout is not done withing that time, it's considered bad and the rollback happens
[13:59:31] <jayme>	 top of helmfile.yaml
[13:59:34] <cdanis>	 ah
[13:59:36] <cdanis>	 thanks :)
[13:59:49] <jayme>	 yw
[14:15:35] <akosiaris>	 ah, that timeout is 600 IIRC and we already had to alter it for the other daemonset, calico. 
[14:15:55] <akosiaris>	 cluster is getting bigger 
[14:22:24] <cdanis>	 yeah indeed