[16:46:35] this is probably the wrong way around, but i have a python script that uses `helmfile apply --set ...` to deploy a special backfilling release that is not part of the normal release process. This release runs to completion, but the related custom operator (flink) only understands things that run forever, so my python script also does a helm destroy to clean up afterwards. [16:47:16] I guess my question is, is there a reasonable way to ensure i'm deleting the thing i think i'm deleting? I was considering perhaps adjusting the chart so i can provide a backfill_id label with --set and then use that id in a selector when destroy'ing [16:48:50] re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1005635/2/helmfile.d/services/cirrus-streaming-updater/cirrus_reindex.py [17:05:00] this sounds like horror tbh :) [17:06:21] but if you define a new release in helmfile.yaml - that's the one you'll be destroying [17:07:17] that script is pretty big ... [17:07:51] from an ignorant point of view, how complex is that operation ? [17:08:31] in theory the operation is simple. We have a program we normally run in streaming mode. We want to run the exact same program with 3 extra options in a config file as one-off backfilling releases [17:08:40] but integrating that...my inexperience with k8s results in this :P [17:09:34] part of the complexity is that we are deploying a custom resource, which is then managed by a k8s operator, not a normal pod deployment [17:09:41] the issue here is that the program isn't gonna run to completion (some exit status) on it's own, right ? It needs to be manually stopped once an operator has deemed that it's done, right? [17:09:58] operator == human being in this question [17:10:14] the k8s operator will monitor the running thing and update the custom resource to say it's finished, but it doesn't shut it down because it doesn't expect anything to ever shut down [17:10:36] ebernhardson: if the goal is just running mwscript on k8s, can I interest you in https://phabricator.wikimedia.org/T341553 instead? I expect it to be available Soon TM [17:10:57] rzl: its two parts, the first half runs mwscript, the second half runs our custom helm release to backfill over the time mwscript was running [17:11:07] ah sorry, I should have kept reading :) [17:12:52] akosiaris: it creates a flink deployment via the flink operator to do the backfill [17:13:14] ah, that helps, thanks for that clarification [17:13:42] and IIUC there is no "do this and then die" flink-operation [17:14:24] indeed, flink itself has a concept of batch jobs, but the only mention i could find from k8s-flink-operator devs was that batch mode isn't supported, but should "mostly" work. I suppose this is what mostly means :) [17:17:31] lol [17:18:47] this probably, right? https://stackoverflow.com/questions/74541368/can-flink-operator-support-batch-job-with-applicationcluster [17:19:17] jayme: ya [17:20:47] so you're saying this basically works but it does not clean up the resources after the job has finished? What will be left over then? [17:21:49] jayme: If the release isn't destroyed it will continue running a jobmanager pod. That should be basically idle, it manages taskmanager pods and there aren't any. I suppose not the end of the world it would be 2 pods in eqiad and 1 in codfw. [17:22:13] i still suspect we would need some tooling around backfilling though, this process is run once per wiki, and there are 1000 wikis [17:24:08] some intention to add a loop over the wikis in this script, but starting with the simpler use case [17:40:45] hmm...I'd say you could maybe leave it around for the next backfill, but I'm not sure that would actually work (e.g. the next one will use the same jobmanager) [17:41:12] I gtg for today, do you have a task for this so we can maybe discuss asyncß [17:41:19] *async? [17:41:48] ah, https://phabricator.wikimedia.org/T356303 I suppose :) [17:52:35] jayme: yup, thats the task. thanks! [18:29:25] and i suppose for reference, the next deploy wont use the same jobmanager pod. flink execution graph is immutable once it starts up, we have to use a new one to change the kafka inputs (i much preferred changing the runtime kafka inputs...but not really possible) [18:30:58] ebernhardson did you reach out on the flink mailing list perchance? Dunno if they have an answer but they've been helpful for me in the past [18:39:07] inflatador: haven't yet, i'm somehow a bit wary of public mailing lists. [18:58:42] not sure if it matters for a first pass, but we could potentially fetch some of the flinkdeployment info with the python kubernetes client a la https://stackoverflow.com/questions/61594447/python-kubernetes-client-equivalent-of-kubectl-get-custom-resource [19:00:25] I'm working on using it to get the latest Flink release job ID...haven't tried looking at custom objects yet though [19:00:34] or custom resources, I should say [19:26:20] Y, looks like I can get the flink info with something like ` custom_objects = custom_api.list_namespaced_custom_object(namespace='rdf-streaming-updater', group='flink.apache.org', version='v1beta1', plural='flinkdeployments')`. Dunno if that has the info we need though [22:19:16] i did see that was possible, but the k8s api in python is so low level, it didn't feel clearly better than making a kubectl call [22:34:10] * ebernhardson is regularly grateful that closing and re-opening a phabricator tab retains the textbox content