[14:16:35] ryankemper no rush but if you wanna look into decomming the old cloudelastic hosts when you get in, I started T357780 for this [14:53:01] T357780: Decommission cloudelastic1001-1004 - https://phabricator.wikimedia.org/T357780 [16:32:36] small CR for getting rid of unused certificate alt names if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/995107 [16:39:25] \o [16:39:36] hmm, Flink uses its entire job ID in the S3 path when it writes checkpoints, but it looks like it only uses the first 5 digits for savepoints https://phabricator.wikimedia.org/P56894 [16:54:01] curious [16:55:09] will have to look out for that in my cleanup script [17:18:46] Speaking of, another CR to take a savepoint before I start deleting stuff from prod: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1004199 I don't foresee a problem, but it's good to have a backup handy just in case [17:33:05] +1 [17:49:38] thx! We've got a 4-day weekend so I'll hold off merging until we get back [17:49:51] err...3-day wkend that is [19:09:24] lunch, back in ~40 [20:22:19] inflatador: so wrt https://phabricator.wikimedia.org/T357780 we're ready to remove 1001-4 from the cluster? [20:27:07] ryankemper Y, they are banned and ready for decom [20:34:52] inflatador: ack, we may as well start running the decom cookbooks then [20:49:26] hmm, not clear what the right approach is for the backfill flink job. It's complaining about not having an available checkpoint, which makes me wonder if we should be disabling backfill checkpointing, or destroy/re-deploy the release each time to clear out the unrelated checkpoints [21:04:19] I'd probably be OK to destroy/re-deploy since backfills are always manual? [21:04:32] at least that's my understanding...maybe wrong about that [21:04:55] ryankemper cool, if I can help out w/that LMK [21:05:35] inflatador: I'll kick off the decoms in an hour (running to lunch) [21:06:23] backfills are kinda/sorta manual. They will be kicked off by the thousands from full-cluster reindexing [21:06:42] * ebernhardson notes that event simply waiting a minute for flink to start is going to add up here :P [21:07:00] s/event/even/ [21:08:35] maybe i will want to look closer into the conditional releases...if we need to destroy after each backfill that should probbly be part of teardown, rather than setup. But that suggests doing normal helmfile apply's shouldn't deploy the -backfill releases