[15:38:35] I need to dig into the thread, but IIRC we had airflow start a new task every ~15 minutes, itself in charge of spawining 100s of pods, and the DAG had a wrongly hardcoded image tag. So as time passed by, more and more pods were being created in error state and it seemed to have overloaded kubernetes [15:39:37] DE is refactoring the job to run a single pod instead of 100s. They lose granularity, but the system is going to be much more nimble and safer [15:41:55] unrelated, but I've alphabetically sorted all services defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/4ceca1668f6965530914dd954a247d6459f19996/hieradata/common/profile/kubernetes/deployment_server.yaml#6 so ease maintenance