[08:31:58] <elukey>	 good morning folks
[08:32:44] <elukey>	 today, if you all agree, I'd like to proceed with the reimage to buster of kafka-main2003 (https://phabricator.wikimedia.org/T296641)
[08:33:22] <elukey>	 last week the main codfw cluster was moved to the kafka fixed uid/gid, and we have a partman recipe in place to preserve /srv
[08:34:04] <elukey>	 what it should happen is that clients will recognize the broker being down and will choose other brokers (following the kafka protocol)
[08:34:30] <elukey>	 when the reimage finishes, kafka should pick up the work nicely
[08:34:58] <elukey>	 we already have brokers running on buster, kafka-main200[4,5] (yes a nice mixed cluster stretch/buster :D)
[08:45:39] <elukey>	 what I'd need is somebody that reads the above and tells me "yep it looks good" or "no Luca it looks weird"
[08:45:42] <elukey>	 :D
[09:41:43] <jayme>	 elukey: Unfortunately I'm in no way qualified to argue about kafka, but what you describe sounds sane to me - generically ... if that helps
[09:43:38] <_joe_>	 elukey: what happens if preserving /srv fails?
[09:43:53] <_joe_>	 we have to declare the broker failed and recreate it i guess
[09:49:03] <elukey>	 _joe_ the broker will preserve its id in the cluster, so the worst that can happen is that it will need to pull kafka log files from the other replicas 
[09:49:19] <_joe_>	 right
[09:49:27] <elukey>	 it should do it automagically in theory
[09:49:45] <_joe_>	 yes, we've done it in the past when kafka1001 failed
[10:52:40] <_joe_>	 I just made the most appropriate typo ever
[10:52:50] <_joe_>	 wrote "helmvile" instead of "helmfile"
[11:15:24] <elukey>	 ahahhaha
[14:55:55] <elukey>	 I am going to stop kafka on kafka-main2003 in a bit, and then reimage, if nobody opposes
[15:29:36] <elukey>	 lovely I can't reach the kafka-main2003's mgmt 
[15:30:23] <elukey>	 I can ping it, but cannot get a session
[15:30:35] <_joe_>	 sigh
[15:30:47] <_joe_>	 hp or dell?
[15:32:01] <elukey>	 dell
[15:32:19] <elukey>	 I also found https://phabricator.wikimedia.org/T267867 while looking for outstanding issues
[15:32:28] <elukey>	 that was opened to past Luca
[15:33:02] <elukey>	 it should be the same use case, I think that we can see if it rehappens (last time it was a matter of restarting since librkafka was in a weird state)
[15:33:41] <_joe_>	 I'm still at the conference btw
[15:34:23] <elukey>	 ack :)
[15:34:46] <volans>	 elukey: see https://wikitech.wikimedia.org/wiki/Management_Interfaces
[15:34:49] <volans>	 if it helps
[15:35:40] <elukey>	 volans: I was looking for this page, <3
[15:35:45] <volans>	 elukey: redfish API are working fine fwiw
[15:36:04] <volans>	 if you need something lmk
[15:37:57] <elukey>	 ipmitool works from cumin2001
[15:38:59] <elukey>	 volans: if you have a moment to test reaching the mgmt console of kafka-main2003 I'd be grateful, in case it doesn't work for you I can attempt a `racadm racreset`
[15:40:49] <volans>	 elukey: so far I'm still waiting for a failure/timeout
[15:41:00] <volans>	 (ssh -vvv)
[15:41:20] <volans>	 surely it's not healthy
[15:42:19] <volans>	 elukey: go on, seems pretty stuck
[15:43:30] <elukey>	 I'll do it via ipmitool, of course the racadm doesn't work
[15:50:07] <elukey>	 volans: works now, thanks :)
[15:50:28] <volans>	 I've done nothing apart writing most of that wikipage :D
[15:50:59] <elukey>	 moral support :D
[15:51:40] <elukey>	 I have a meeting in a bit, after that I'll kick off the reimage process
[15:51:43] <elukey>	 my plan is to
[15:52:14] <elukey>	 1) systemctl stop kafka on kafka-main2003 (so that purged should get a graceful "don't use me please")
[15:52:29] <elukey>	 2) wait 5/10 mins to check eventgate and purged
[15:52:34] <elukey>	 3) kick off the reimage
[15:56:35] <_joe_>	 +1
[15:56:52] <_joe_>	 elukey: or, chaos engineering! move fast break things work until 9 pm
[15:56:59] <_joe_>	 your choice :D
[15:59:19] <elukey>	 ahahahha
[16:58:03] <elukey>	 all right proceeding :)
[17:02:05] <elukey>	 (commenting on #operations)
[17:03:36] <wikibugs>	 10serviceops, 10GitLab (Infrastructure): Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto)
[17:15:51] <wikibugs>	 10serviceops, 10Patch-For-Review: Upgrade kafka-main nodes to buster - https://phabricator.wikimedia.org/T296641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster
[17:40:22] <wikibugs>	 10serviceops, 10Patch-For-Review: Upgrade kafka-main nodes to buster - https://phabricator.wikimedia.org/T296641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2003.codfw.wmnet with OS buster executed with errors: - kafka-main2003 (**FAIL**)   - Downt...
[17:49:51] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: admin_ng private data no longer in /etc/helmfile-defaults/private/admin/ - https://phabricator.wikimedia.org/T297417 (10JMeybohm) p:05Triage→03Medium
[18:13:58] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson)
[18:14:16] <elukey>	 folks kafka-main2003 is back in service, but with a new puppet cert
[18:14:29] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10Cmjohnson) 05Open→03Resolved removed from rack and netbox updated.
[18:14:33] <wikibugs>	 10serviceops, 10decommission-hardware: decommission thumbor100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T273137 (10Cmjohnson)
[18:14:36] <elukey>	 it seems that we need to upgrade the firmware :(
[18:14:41] <elukey>	 I'll open a task
[18:14:50] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson)
[18:15:26] <wikibugs>	 10serviceops, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10Cmjohnson) 05Open→03Resolved removed from rack and updated netbox
[18:19:22] <wikibugs>	 10serviceops, 10ops-codfw: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey)
[18:19:57] <elukey>	 created --^
[18:20:17] <elukey>	 the kafka node is up, all good, I am going to log off, will set some time with Papaul next week to upgrade its firmware
[18:20:26] <elukey>	 if it is the root cause, we'll need to do it for all nodes :(
[23:22:25] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Radar): Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Brennen got that right, there are 2 separate SSH daemons on each gitlab server....
[23:37:11] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10brennen) > I am tempted to call this a minor bug or improvement...
[23:39:04] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10Upstream: Self-reported GitLab SSH host key fingerprints don’t appear to match actual host key fingerprints - https://phabricator.wikimedia.org/T296944 (10Dzahn) Cool! Yea, so our case is we have 2 IP addresses on the s...