[07:58:26] gehel: would you have time today to discuss about communication message for the graph split, ideally before 6pm when we have a meeting? [09:49:02] lunch [10:04:51] dcausse: invite sent for 14:30 [13:14:36] o/ [13:44:12] https://gerrit.wikimedia.org/r/c/operations/alerts/+/1009359 (cirrus streaming updater alerts) is ready for review if anyone has a chance to look [14:11:49] workout, back in ~40 [14:57:46] \o [15:08:57] o/ [17:10:47] hmm, no taskmanager running for consumer-cloudelastic [17:12:33] * inflatador wonders if my alerts patch would actually catch that scenario [17:12:47] looks like oom. The last log is at 01:15, amost 16 hours ago. I was indeed thinking not alerts went off :) [17:12:57] i wonder why nothing has tried to restart it in 16 hours though [17:13:14] i guess i would have expected at least an hourly attempt [17:13:18] yeah, k8s should do that itself...at least theoretically [17:26:52] might be nice if the k8s operator logs could make it into logstash, or maybe they do and i just can't find them. I see there is a flink-operator namespace in k8s but nothing in logstash [17:28:14] I wish they'd add the ECS K8s logs dashboard to the main logstash page [17:33:12] For this oom...not entirely sure. taskmanagers only report about 60% memory used. So whatever blew it up quickly consumed all the memory. [17:39:27] restarted and it seems to be running. So i guess the question is why the operator didn't do that [17:40:18] job managers were up, but not task managers? [17:41:05] oh, yes that would be why the operator didn't do anything [17:41:18] hmm, but job manager reported failed and the operator was reporting the failed state [17:42:38] I'm assuming kubernetes would "just take care of" container ooms...backoff logic would eventually stop it, but you'd think it would at least try [17:43:05] I guess I don't fully understand the operator's responsibility in this scenario [17:43:35] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/architecture/#flink-resource-lifecycle [17:45:46] yea, this doesn't make it clear how the state of the application itself works. Maybe flink is missing some configuration to kill the jobmanager when the job fails? [17:50:26] I guess "terminally failed" is a state that can't be recovered from? You'd think it would take more than just a simple OOM to get into that state, though [17:55:34] i suppose this is the important bit: When kubernetes.operator.job.restart.failed is set to true then at the moment when the job status is set to FAILED the kubernetes operator will delete the current job and redeploy the job using the latest successful checkpoint. [17:55:39] i can't find if we set that anywhere though [17:56:56] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.4/docs/operations/configuration/ says it's off by default. curious [17:57:35] yeah, that confounds my expectations [17:58:36] docs say we can set it in spec.flinkConfiguration. I suppose why not [17:59:21] yeah, I can get a patch up [17:59:46] should we put it in the chart itself? Seems like a reasonable default [17:59:58] `charts/flink-kubernetes-operator/conf/flink-conf.yaml` that is [18:00:26] yea, i would expect that to be the default. Maybe it's not on by default since it only work well with ha or something [18:05:53] OK, not sure this is the right place, but here goes: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017115 [18:07:09] ryankemper gehel late lunch, will be late to pairing [18:07:32] kk [18:57:44] inflatador, ryankemper: we have a RAID issue on elastic2088, could you check: https://phabricator.wikimedia.org/T361525 [18:58:22] Oh, I see you're already on it, it's just not in the current milestone yet. I'm moving it [18:58:38] thx [18:58:38] gehel that one's already been turned over to DC Ops, should we take our tags off? [18:58:58] Let's keep our tags but move to blocked/waiting [18:59:28] yep, exactly. This should come back to us once the H/W issue is fixed [18:59:37] Okay, moved it. and gehel already threw on the milestone so I think phab state is where it should be now [18:59:44] yep [20:25:28] hmm, for whatever reason helmfile does not see the changes from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017115 's original patch set...moving stuff to values.yaml, let's see if that makes a difference [20:26:03] curious [20:32:44] I just did a `helmfile diff` on the latest patchset from my homedir on deploy1002, looks like it's recognizing a change now [20:37:59] quick break, back in ~20 [20:59:25] back