[10:04:22] Quick question: Are there special cookbooks that people are using to reboot a) k8s control plane servers (e.g. dse-k8s-ctrl100[1-2]) and b) k8s etcd clusters (e.g. dse-k8s-etcd100[1-3]) ? Thanks. [10:05:47] I don't know but: there are a bunch of sre.k8s cookbooks (that include at least one reboot one) [10:07:05] Thanks. Yes, I've been using that for worker nodes, but I'm not sure if it's suitable for the control plane or the database hosts. [10:08:01] if this is about the current reboots, then you can ignore the dse-k8s etcd nodes, since they don't use DRBD they all were implicitly rebooted as part of the Ganeti reboots [10:09:26] <_joe_> btullis: I don't know if we have an etcd reboot cookbook, but that would indeed be useful given the number of clusters we have now [10:14:34] moritzm: Thanks. That is a good point. I'll tick those six hosts off. _joe_ : Yes, I'm just making a list of missing reboot cookbook, where I'm having to string together `seq` and the `reboot-single` cookbook at the moment. etcd is on there, although in this case it seems no action is required, after all. [10:20:08] the SREBatchBase class is perfect this, all we need is to specify the Cumin aliases used for etcd clusters and a health check command to validate the cluster is up before the next node is picked [10:36:37] Thanks moritzm. I think we can also look at extending the zookeeper cookbook to support reboots with SREBatchBase. It does restarting, but not rebooting. [10:39:15] or rather create a discrete one base on roll-restart-zookeeper, the underlying framework is the same [10:39:35] I don't have the time for this currently, but happy to review any patches [10:48:31] Great, thanks. I only have time to make a list, too. Maybe it'll make a nice onboarding task for our new SRE, Atsuko. Joining us on the 30th. [11:00:37] kinda related, there's been some chatter about adding a start timestamp to roll reboot cookbooks to make them resumable in https://phabricator.wikimedia.org/T419967 [11:00:55] btullis: sounds good [11:01:31] godog: the initial implementation of this is already done, Effie implemented that for the memcached cookbook. that task is mostly about making it a little more generic [11:02:27] moritzm: yeah I saw the patch, I suggested a timestamp approach instead but maybe I'm missing something re: --min-time usage [11:04:19] yeah, a fixed time stamp like 2026-03-19:08:40 or so is probably the easiest to handle [11:05:18] the basic idea is that e.g. a config patch was deployed by that time which needs a reboot or restart, all hosts running the service with a timestamp of that time or an uptime beyond it for a reboot are skipped [11:05:53] exactly [11:11:41] it is a POC atm, but yes it has helped [11:12:00] I added a couple more feature requests :p [11:15:15] godog: update the description with suggestions/requests, just to have them thre when someone picks it up [11:15:30] neat, thank you effie ! [11:30:01] btullis: You should be able to use the k8s.roll-reboot cookbook for control plane servers, even though I usually don't, I just reboot them using single (but we have stacked k8s apiserver and etcd) [11:30:46] OK, thanks claime. I might try it then, if I'm feeling bold. [11:30:57] sre.k8s.reboot-nodes cookbook sorry [11:32:01] just use a batch size of 1 :P [11:35:57] Ha! Yes, good thinking. [13:09:11] Raine: jhathaway in the next hours I am going to deploy a big change on media backup storage [13:10:15] this will have no risk for production, and backups do not p*ge [13:10:47] but I cannot predict it could leave errors or regular alerts while doing it [13:10:55] just fyi [13:22:46] thanks jynus [13:56:00] ack, thanks, glhf :D [14:06:27] sorry if It was announced before, I was ooo sick for a couple of days. Did bast2003.wikimedia.org changed ssh keys? [14:07:52] > START - Cookbook sre.hosts.downtime for 2:00:00 on bast2003.wikimedia.org with reason: host reimage [14:07:52] in sal, [14:08:39] brouberol: it indeed had hardware issues T420320 [14:08:40] T420320: bast2003 boot failure - https://phabricator.wikimedia.org/T420320 [14:09:16] tx taavi [20:09:17] when I though git was back and did a merge, to puppet server failed to pull, will it get fixed automatically on next update or how to manually pull? documented method on wikitech is not working [20:09:26] *two [20:10:23] this is literally my case, but it is guarded now for not being primary https://wikitech.wikimedia.org/wiki/Puppet#puppet-merge_fails_to_sync_on_secondary [20:16:53] jynus: happy to take a look, so the sync to one of the puppet servers failed? [20:18:07] only 2 failed to pull [20:18:21] but they are nto stuck or anything, it was just a transitory network error [20:18:24] just merge something else and repeat puppet-merge? [20:18:43] if it works by just merging, then don't worry [20:18:50] yeah that should work, but it would be nice to have another runbook option [20:18:56] I will do other deploys when thingts are more stable [20:19:04] not a worry now [20:19:44] note that I had some issues with the private puppet repo's git about 2h ago, see #security. Hoping it's not related but just FYI [20:20:16] thanks inflatador [20:22:55] jynus: which 2 failed to pull? [20:23:22] 2004 and 2002 [20:23:48] nope, this was the public puppet repo [20:23:57] thanks [20:24:24] and clearly was http/tcp errors due to ongoing issues (but I had started when I realized they were still ongoing) [20:27:08] but it may be a false hearring [20:28:51] another things I see is a lot of http.request.headers.accept_language":"zh-cn,zh;q=0.9,en;q=0.8 from those times, too [20:29:06] wrong channel [21:05:13] jhathaway: I "unclogged" the servers with a new merge [21:05:51] jynus: thanks, I was going to test another option, but I leave that for another time [21:06:14] oh, I didn't know [21:35:58] not your fault at all, I should have mentioned I was poking around :)