[11:47:25] It looks like we have an active incident on the dse-k8s cluster that is affecting the kube-api servers. No user-facing impact yet, as far as we know. https://phabricator.wikimedia.org/T389720 [11:48:03] If anyone has seen anything like `[fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields"` before, or has any other insight, please don't hesitate to chip in. [11:49:39] btullis: I've seen it in the past IIRC, it is not a fatal, I think that the issue is the kubeapi-server reloading and being overwhelmed and not responding to probes [11:50:22] is it still down? [11:50:32] OK, thanks. I will check the metrics around kube-api server and possibly bump up its ganeti resources. [11:50:32] or did it respond to probes right after? [11:50:44] yeah I think it probably needs a little more cpus [11:50:51] ml had the same problem a while ago IIRC [11:51:04] there is a reason why the kube-apiservers are reloading that I don't recall [11:51:10] No, the service is still up and running, but both controllers are logging those messages at a firsly high frequency. [11:51:33] I mean I am not saying it is healthy that they log that thing :D [11:51:36] It make sense, as we have been ramping up usage quite a lot recently. [11:51:45] but it may be an obscure bug that we'll get rid of with the k8s migration [11:51:46] But yeah, they're not locked up. [11:52:06] OK, thanks. Will report back here, as well as on the ticket. [12:06:19] elukey: If you do happen to remember why the kube-apiservers were restarting on that previous occasion, that might be helpful. [12:21:56] certificate updates maybe? [12:22:09] I seem to remember that was one of the cases where ours restarted and caused issues [12:25:54] claime: Ack, thanks. I will check for correlation. [15:43:46] btullis: I am not sure when the apiserver went down, but this may be useful [15:43:49] Mar 26 12:38:21 dse-k8s-ctrl1002 systemd[1]: Starting kube-apiserver-safe-restart.service - Restart kube-apiserver using a etcd lock... [15:43:59] it is used by puppet for some use cases, like what claime mentioned