[19:27:16] We had an incident today where Airflow created thousands of pods in ERROR state. Does anyone know of a way we can prevent pods from being created after a certain number of % of pods in that ns are in ERROR state? I was hoping maybe kyverno [19:34:37] inflatador: I think KubernetesJobOperator (instead of PodOperator) would have prevented this -- there are a lot (perhaps even too many) different knobs in jobs about controlling failure states [19:35:20] https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures [19:43:47] ACK, https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html#difference-between-kubernetespodoperator-and-kubernetesjoboperator seems to support your theory [19:44:15] "Users can limit how many times a Job retries execution using configuration parameters like activeDeadlineSeconds and backoffLimit" [19:46:45] it would also move a lot of the really repetitive work from the airflow<>k8sapiserver RPC interface, to within-process on the apiservers [19:53:23] I know we've looked at using celery workers for similar reasons [19:54:22] I'm guessing it's easier to refactor DAGs from PodOperator to KubernetesJobOperator though