[15:05:44] working on getting our new service into production on wikikube and just curious...do any non-service-ops-owned services require special handling during a cluster upgrade? ref https://phabricator.wikimedia.org/T342149#9230869 [15:18:17] hello folks, tried to refactor the k8s high latency alert: https://gerrit.wikimedia.org/r/c/operations/alerts/+/964025 [15:19:18] (changing the dashboard link to be more precise) [15:19:28] if you want to review it please go ahead :) [15:20:17] inflatador: in the ML case it is just deploying our custom stack (istio/knative/etc..) [15:21:03] istio configs are deployed on all clusters via istioctl, a binary that takes specific yaml manifest as inputs, and we store them in the custom.d dir in deployment-charts [15:21:31] in the case that you pointed out we'll need to add the specific command in the upgrade procedure (probably) [15:21:36] (not sure if it answers) [15:22:51] I think the question is how do you bring the cluster down, do you undeploy individual services (via helmfile destroy) or simply "turn off" k [15:22:53] 8s [15:23:38] dcausse: ah ok! So in the past, IIRC, we wiped etcd, that meant resetting the cluster [15:23:40] https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/k8s/upgrade-cluster.py looking at the cookbook now too [15:23:43] there is a cookbook about it [15:24:04] (of course after depooling etc..) [15:24:16] if the services are just depooled and not "undeployed" then I think we should be good [15:24:19] specific helmfile destroy commands will need to be done before the upgrade, if any [15:25:39] sounds like we'd be OK, but if there's a way to step thru this on a test cluster let us know... [15:26:30] yes would be nice to have a way to "simlute" this so we make sure the service recovers properly after deploying [15:26:37] s [15:26:48] s/simlute/simulate/ [15:27:16] mmm reading through the phab task, it seems that the flink operator is a little flaky in this regard [15:27:42] the next upgrade is very far from now (we need to migrate away from PSP and change the container engine before that) [15:27:49] we can probably discuss with Janis this use case [15:28:09] (going afk for today folks, have a nice weekend!) [15:28:39] enjoy! [15:28:43] thanks, enjoy [21:04:50] We just deployed a new service on wikikube-staging, cirrus-streaming-updater ( https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/964071/ ). [21:08:13] well...we can't actually get it to deploy with helmfile. The namespace looks fine, just no resources are being created with an apply. If anyone has time next week to look at it, reach out to myself d-causse or e-bernhardson . Thanks for your time