[08:31:58] good morning folks [08:32:44] today, if you all agree, I'd like to proceed with the reimage to buster of kafka-main2003 (https://phabricator.wikimedia.org/T296641) [08:33:22] last week the main codfw cluster was moved to the kafka fixed uid/gid, and we have a partman recipe in place to preserve /srv [08:34:04] what it should happen is that clients will recognize the broker being down and will choose other brokers (following the kafka protocol) [08:34:30] when the reimage finishes, kafka should pick up the work nicely [08:34:58] we already have brokers running on buster, kafka-main200[4,5] (yes a nice mixed cluster stretch/buster :D) [08:45:39] what I'd need is somebody that reads the above and tells me "yep it looks good" or "no Luca it looks weird" [08:45:42] :D [09:41:43] elukey: Unfortunately I'm in no way qualified to argue about kafka, but what you describe sounds sane to me - generically ... if that helps [09:43:38] <_joe_> elukey: what happens if preserving /srv fails? [09:43:53] <_joe_> we have to declare the broker failed and recreate it i guess [09:49:03] _joe_ the broker will preserve its id in the cluster, so the worst that can happen is that it will need to pull kafka log files from the other replicas [09:49:19] <_joe_> right [09:49:27] it should do it automagically in theory [09:49:45] <_joe_> yes, we've done it in the past when kafka1001 failed [10:52:40] <_joe_> I just made the most appropriate typo ever [10:52:50] <_joe_> wrote "helmvile" instead of "helmfile" [11:15:24] ahahhaha [14:55:55] I am going to stop kafka on kafka-main2003 in a bit, and then reimage, if nobody opposes [15:29:36] lovely I can't reach the kafka-main2003's mgmt [15:30:23] I can ping it, but cannot get a session [15:30:35] <_joe_> sigh [15:30:47] <_joe_> hp or dell? [15:32:01] dell [15:32:19] I also found https://phabricator.wikimedia.org/T267867 while looking for outstanding issues [15:32:28] that was opened to past Luca [15:33:02] it should be the same use case, I think that we can see if it rehappens (last time it was a matter of restarting since librkafka was in a weird state) [15:33:41] <_joe_> I'm still at the conference btw [15:34:23] ack :) [15:34:46] elukey: see https://wikitech.wikimedia.org/wiki/Management_Interfaces [15:34:49] if it helps [15:35:40] volans: I was looking for this page, <3 [15:35:45] elukey: redfish API are working fine fwiw [15:36:04] if you need something lmk [15:37:57] ipmitool works from cumin2001 [15:38:59] volans: if you have a moment to test reaching the mgmt console of kafka-main2003 I'd be grateful, in case it doesn't work for you I can attempt a `racadm racreset` [15:40:49] elukey: so far I'm still waiting for a failure/timeout [15:41:00] (ssh -vvv) [15:41:20] surely it's not healthy [15:42:19] elukey: go on, seems pretty stuck [15:43:30] I'll do it via ipmitool, of course the racadm doesn't work [15:50:07] volans: works now, thanks :) [15:50:28] I've done nothing apart writing most of that wikipage :D [15:50:59] moral support :D [15:51:40] I have a meeting in a bit, after that I'll kick off the reimage process [15:51:43] my plan is to [15:52:14] 1) systemctl stop kafka on kafka-main2003 (so that purged should get a graceful "don't use me please") [15:52:29] 2) wait 5/10 mins to check eventgate and purged [15:52:34] 3) kick off the reimage [15:56:35] <_joe_> +1 [15:56:52] <_joe_> elukey: or, chaos engineering! move fast break things work until 9 pm [15:56:59] <_joe_> your choice :D [15:59:19] ahahahha [16:58:03] all right proceeding :) [17:02:05] (commenting on #operations) [17:03:36] 10serviceops, 10GitLab (Infrastructure): Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) [17:15:51] 10serviceops, 10Patch-For-Review: Upgrade kafka-main nodes to buster - https://phabricator.wikimedia.org/T296641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster [17:40:22] 10serviceops, 10Patch-For-Review: Upgrade kafka-main nodes to buster - https://phabricator.wikimedia.org/T296641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster executed with errors: - kafka-main2003 (**FAIL**) - Downt... [17:49:51] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: admin_ng private data no longer in /etc/helmfile-defaults/private/admin/ - https://phabricator.wikimedia.org/T297417 (10JMeybohm) p:05Triageโ†’03Medium [18:13:58] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson) [18:14:16] folks kafka-main2003 is back in service, but with a new puppet cert [18:14:29] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson) 05Openโ†’03Resolved removed from rack and netbox updated. [18:14:33] 10serviceops, 10decommission-hardware: decommission thumbor100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T273137 (10Cmjohnson) [18:14:36] it seems that we need to upgrade the firmware :( [18:14:41] I'll open a task [18:14:50] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson) [18:15:26] 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson) 05Openโ†’03Resolved removed from rack and updated netbox [18:19:22] 10serviceops, 10ops-codfw: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey) [18:19:57] created --^ [18:20:17] the kafka node is up, all good, I am going to log off, will set some time with Papaul next week to upgrade its firmware [18:20:26] if it is the root cause, we'll need to do it for all nodes :( [23:22:25] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Radar): Self-reported GitLab SSH host key fingerprints donโ€™t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Brennen got that right, there are 2 separate SSH daemons on each gitlab server.... [23:37:11] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving ๐Ÿƒ๐Ÿช’), 10Upstream: Self-reported GitLab SSH host key fingerprints donโ€™t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10brennen) > I am tempted to call this a minor bug or improvement... [23:39:04] 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving ๐Ÿƒ๐Ÿช’), 10Upstream: Self-reported GitLab SSH host key fingerprints donโ€™t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Cool! Yea, so our case is we have 2 IP addresses on the s...