[10:04:22] <btullis>	 Quick question: Are there special cookbooks that people are using to reboot a) k8s control plane servers (e.g. dse-k8s-ctrl100[1-2]) and b) k8s etcd clusters (e.g. dse-k8s-etcd100[1-3]) ? Thanks.
[10:05:47] <Emperor>	 I don't know but: there are a bunch of sre.k8s cookbooks (that include at least one reboot one)
[10:07:05] <btullis>	 Thanks. Yes, I've been using that for worker nodes, but I'm not sure if it's suitable for the control plane or the database hosts.
[10:08:01] <moritzm>	 if this is about the current reboots, then you can ignore the dse-k8s etcd nodes, since they don't use DRBD they all were implicitly rebooted as part of the Ganeti reboots
[10:09:26] <_joe_>	 btullis: I don't know if we have an etcd reboot cookbook, but that would indeed be useful given the number of clusters we have now
[10:14:34] <btullis>	 moritzm: Thanks. That is a good point. I'll tick those six hosts off. _joe_ : Yes, I'm just making a list of missing reboot cookbook, where I'm having to string together `seq` and the `reboot-single` cookbook at the moment. etcd is on there, although in this case it seems no action is required, after all.
[10:20:08] <moritzm>	 the SREBatchBase class is perfect this, all we need is to specify the Cumin aliases used for etcd clusters and a health check command to validate the cluster is up before the next node is picked
[10:36:37] <btullis>	 Thanks moritzm. I think we can also look at extending the zookeeper cookbook to support reboots with SREBatchBase. It does restarting, but not rebooting.
[10:39:15] <moritzm>	 or rather create a discrete one base on roll-restart-zookeeper, the underlying framework is the same
[10:39:35] <moritzm>	 I don't have the time for this currently, but happy to review any patches
[10:48:31] <btullis>	 Great, thanks. I only have time to make a list, too. Maybe it'll make a nice onboarding task for our new SRE, Atsuko. Joining us on the 30th.
[11:00:37] <godog>	 kinda related, there's been some chatter about adding a start timestamp to roll reboot cookbooks to make them resumable in https://phabricator.wikimedia.org/T419967
[11:00:55] <moritzm>	 btullis: sounds good
[11:01:31] <moritzm>	 godog: the initial implementation of this is already done, Effie implemented that for the memcached cookbook. that task is mostly about making it a little more generic
[11:02:27] <godog>	 moritzm: yeah I saw the patch, I suggested a timestamp approach instead but maybe I'm missing something re: --min-time usage
[11:04:19] <moritzm>	 yeah, a fixed time stamp like 2026-03-19:08:40 or so is probably the easiest to handle
[11:05:18] <moritzm>	 the basic idea is that e.g. a config patch was deployed by that time which needs a reboot or restart, all hosts running the service with a timestamp of that time or an uptime beyond it for a reboot are skipped
[11:05:53] <godog>	 exactly
[11:11:41] <effie>	 it is a POC atm, but yes it has  helped 
[11:12:00] <effie>	 I added a couple more  feature requests :p
[11:15:15] <effie>	 godog: update the description with suggestions/requests, just to have them thre when someone picks it up
[11:15:30] <godog>	 neat, thank you effie !
[11:30:01] <claime>	 btullis: You should be able to use the k8s.roll-reboot cookbook for control plane servers, even though I usually don't, I just reboot them using single (but we have stacked k8s apiserver and etcd)
[11:30:46] <btullis>	 OK, thanks claime. I might try it then, if I'm feeling bold.
[11:30:57] <claime>	 sre.k8s.reboot-nodes cookbook sorry
[11:32:01] <claime>	 just use a batch size of 1 :P
[11:35:57] <btullis>	 Ha! Yes, good thinking.
[13:09:11] <jynus>	 Raine: jhathaway in the next hours I am going to deploy a big change on media backup storage
[13:10:15] <jynus>	 this will have no risk for production, and backups do not p*ge
[13:10:47] <jynus>	 but I cannot predict it could leave errors or regular alerts while doing it
[13:10:55] <jynus>	 just fyi
[13:22:46] <jhathaway>	 thanks jynus 
[13:56:00] <Raine>	 ack, thanks, glhf :D 
[14:06:27] <brouberol>	 sorry if It was announced before, I was ooo sick for a couple of days. Did bast2003.wikimedia.org changed ssh keys? 
[14:07:52] <brouberol>	 > START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage
[14:07:52] <brouberol>	 in sal, 
[14:08:39] <taavi>	 brouberol: it indeed had hardware issues T420320
[14:08:40] <stashbot>	 T420320: bast2003 boot failure - https://phabricator.wikimedia.org/T420320
[14:09:16] <brouberol>	 tx taavi
[20:09:17] <jynus>	 when I though git was back and did a merge, to puppet server failed to pull, will it get fixed automatically on next update or how to manually pull? documented method on wikitech is not working
[20:09:26] <jynus>	 *two
[20:10:23] <jynus>	 this is literally my case, but it is guarded now for not being primary https://wikitech.wikimedia.org/wiki/Puppet#puppet-merge_fails_to_sync_on_secondary
[20:16:53] <jhathaway>	 jynus: happy to take a look, so the sync to one of the puppet servers failed?
[20:18:07] <jynus>	 only 2 failed to pull
[20:18:21] <jynus>	 but they are nto stuck or anything, it was just a transitory network error
[20:18:24] <mutante>	 just merge something else and repeat puppet-merge?
[20:18:43] <jynus>	 if it works by just merging, then don't worry
[20:18:50] <jhathaway>	 yeah that should work, but it would be nice to have another runbook option
[20:18:56] <jynus>	 I will do other deploys when thingts are more stable
[20:19:04] <jynus>	 not a worry now
[20:19:44] <inflatador>	 note that I had some issues with the private puppet repo's git about 2h ago, see #security. Hoping it's not related but just FYI
[20:20:16] <jhathaway>	 thanks inflatador 
[20:22:55] <jhathaway>	 jynus: which 2 failed to pull?
[20:23:22] <jynus>	 2004 and 2002
[20:23:48] <jynus>	 nope, this was the public puppet repo
[20:23:57] <jhathaway>	 thanks
[20:24:24] <jynus>	 and clearly was http/tcp errors due to ongoing issues (but I had started when I realized they were still ongoing)
[20:27:08] <jynus>	 but it may be a false hearring
[20:28:51] <jynus>	 another things I see is a lot of http.request.headers.accept_language":"zh-cn,zh;q=0.9,en;q=0.8 from those times, too
[20:29:06] <jynus>	 wrong channel
[21:05:13] <jynus>	 jhathaway: I "unclogged" the servers with a new merge
[21:05:51] <jhathaway>	 jynus: thanks, I was going to test another option, but I leave that for another time
[21:06:14] <jynus>	 oh, I didn't know
[21:35:58] <jhathaway>	 not your fault at all, I should have mentioned I was poking around :)