[07:00:31] good morning folks [07:00:47] if everybody agrees I'd merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/831096 to move kafka-logging2001's kafka to PKI [07:01:09] really easy to roll back if it doesn't work [07:01:50] All kafka clients of kafka logging codfw should be already updated with the new CA bundle (puppet + PKI root CA) and we tested the setting extensively on kafka test [07:02:29] (moving to PKI should, in theory, reduce some annoying/boring steps while bootstrapping a kafka cluster) [07:50:03] elukey: don't ask if everyone agrees, just if someone disagree :-P [07:52:27] Is anyone around to review https://gerrit.wikimedia.org/r/c/operations/dns/+/831479 ?? [07:52:29] volans: sure :) [07:52:41] I'm de-pooling esams ahead of upgrading cr3 there this morning [07:53:34] topranks: looking [07:54:17] thanks :) [07:54:31] {done} [08:09:38] kafka-logging2001 migrated to PKI, nothing on fire afaics [08:09:52] please let me or the observability team know if you see issues (from the clients etc..) [08:10:10] jbond: first broker (outside kafka test) in prod with PKI \o/ [08:38:36] elukey: great work thats aweseom :) [11:47:22] I'm probably missing something but https://gerrit.wikimedia.org/r/c/operations/puppet/+/831528 is failing tox -e mtail tests according to our CI [11:47:50] the weird thing is that https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/51368/console is referring to an older version of the code [11:50:11] and of course it's working on my local env [11:50:12] sigh [12:04:57] oh: are you starting to work now? [13:01:30] probably the schedule has yet to be edited for them [13:23:58] claime: I'm about to merge the patch for the decom cookbook, by any chance any of the servers you decomm'ed do still have the remote IPMI issue? [13:24:03] to test it [13:25:36] volans: Oh, sorry, I shut them all down manually [13:25:50] but didn't fix the remote IPMI I guess [13:25:55] No I didn't [13:26:16] So you can still run the cookbook on them is what you mean? [13:26:25] ok great, so I can test it in the failure scenario and then maybe try to fix one and re-run it [13:27:07] ack [13:49:46] claime: and ofc now IPMI works just fine for all of them LOL [13:51:39] volans: that was to be expected following the law of schrodingers bug [13:51:47] exists until you try to look at it [13:51:54] eheheh indeed [13:52:41] I'll run it anyway to test the almost happy path [13:53:27] volans: would you have a bit of time to teach me a bit about netbox later on, or point me towards the right talk? I'm on the training checklist grind ;) [13:54:44] claime: sure, I've a meeting in few minutes but I can ping you back later [13:56:01] volans: cool, thanks :) I have meetings starting at 1700, but if we can't do it today it can be more or less anytime in the week [13:57:08] ok. So we don't have a full or anywhere recent session about netbox [13:57:17] maybe could be an occasion to do one [13:57:51] although I have to see if I can find the time this week, have a bunch of things to prepare for next week [13:59:31] volans: If there is any recent talks on the subjects you're SME for in the on-call readiness checklist, I'm happy with any links to them as well. If you don't have time this week nbd, I'll try other SMEs on other subjects [13:59:38] I'm not short on things to learn, that's for sure [14:01:55] :) [15:06:47] rzl, cwhite, my handover notes: https://docs.google.com/document/d/1YfRdiL2D0iCkkEnb12iq5IXbq8jAseMhhcXMu1G6LSA [15:07:05] thanks! [15:07:51] this is not required at the moment, but I hope it can help as a pilot to decide how to setup that [15:08:11] maybe you want to keep it outdated on your turn?, idk [15:10:01] got it, thanks [15:10:05] should we limit it to stuff that actually generated a pag.e? [15:10:47] I wrote about what I had a look at personally, I don't expect people to do the same [15:11:00] that's my plan yeah, I don't intend to explicitly document and hand over nonpaging stuff unless there was notable impact [15:12:14] anything that has notable impact or required a lot of your time as an oncaller -- "it paged" is sufficient but not necessary [15:13:00] I think it is ok to write more, I don't expect it to be required, that is my take [15:13:47] e.g. the wikibase thing I didn't fully understood it, so I give a heads up, someone else would just ignore it, and I think both ways are valid IMHO [15:17:32] fwiw, the more extra information is in there, the more there is for the next person to read :) if you only include the important stuff, you can be more confident it was read and understood [15:17:43] I wouldn't tell you to stop, but I do think it's a more effective handoff if you can be more selective [15:18:02] ok [15:18:45] although, from my point of view, I only wrote 2 lines 0:-) [18:15:22] rzl: FYI httpbb is failing on the cumin hosts, I think because of the move to php 7.4 [18:15:25] X-Powered-By header: expected to match /PHP/7\.2/, got 'PHP/7.4.30'. [18:15:35] thanks, I'll check it out [18:15:50] those tests are specifically for the 7.4 migration actually, but something may have changed [18:16:11] it's mw1418 AFAICT [18:17:07] ack, all yours, thanks for looking :) [18:17:08] yeah that's the only appserver we run the tests on [18:17:19] rzl: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/830783 [18:17:45] ah cheers [18:18:57] I'll chat with j.oe tomorrow before doing anything, but I might delete those early if they're already failing -- or at least comment them out so they don't fail the hourly run and trip the alert [18:23:41] or at least update the "default" test case to accept PHP/7\.[24]