[07:18:11] <XioNoX>	 I just got an email titled "[IMPORTANT] Action Required for Your Wikitech Account Migration" which says "Log in to idm.wikipedia.org [...]" but idm.wikipedia.org doesn't exist, I guess it's a typo with idm.wikimedia.org" ?
[08:03:00] <elukey>	 hello on-callers!
[08:03:18] <elukey>	 I am going to attempt to restart kafka on kafka-main2001 for https://phabricator.wikimedia.org/T370574
[08:05:25] <claime>	 elukey: ack
[08:07:40] <XioNoX>	 Hello all, we're going to upgrade Netbox. We expect it to take less than 2h, during that time Netbox will be unavailable for up to 1h (probably less). So please refrain from doing any provisioning, VM creation, etc...
[08:08:50] <elukey>	 re: kafka-main2001, I see some good sign of recovery, namely a spike in the replica fetcher thread actions.. It will take a bit, will keep it monitored
[08:24:21] <_joe_>	 elukey: nice, thanks <3
[08:25:29] <elukey>	 I think I may also need to restart kafka-main2005 now that I see, but I'll wait a bit :)
[08:46:30] <elukey>	 I think that all nodes are somehow in a weird state, I restarted kafka on 2005 and 2003 started to show some lag in fetching partitions
[08:46:36] <elukey>	 nothing really horrible in the logs
[08:46:44] <elukey>	 I think they just got into a weird state
[08:46:54] <elukey>	 due to the network partitions happened last thursday
[08:47:04] <elukey>	 so if you are ok, I'd proceed and restart the other 3 nodes
[08:54:10] <elukey>	 mmmm ok weird, from kafka topics describe 2001 seems still the only one without its Isr in good state
[08:57:08] <claime>	 godog: may want to sync with elukey if you're rebalancing kafka topics ^
[08:57:30] <godog>	 claime: thank you, I'm working on kafka-logging fwiw
[08:57:38] <godog>	 not kafka-main that is
[08:57:45] <claime>	 ah ok
[08:57:46] <claime>	 mb
[08:58:33] <godog>	 np it is good to call potential overlap out
[09:00:42] <elukey>	 ouch no I think that one partition of kafka-main2001 may be corrupted
[09:38:13] <godog>	 I'm rebalancing mediawiki.httpd.accesslog across kafka-logging eqiad brokers and I think I may have made a mistake by rebalancing all partitions at the same time, there might not be enough space to slosh everything around :|
[09:43:13] <godog>	 I've set throttle to at least slow things down, to not avail so far
[09:43:41] <godog>	 anyways if anyone knows how and/or if it possible to cancel an going rebalance any tip/suggestion is welcome
[10:02:28] <godog>	 ok halving the retention seems to have helped actually
[10:06:41] <topranks>	 godog: can't help with Kafka but I had a quick look at the network whatever is going on doesn't seem to be causing problems in terms of throughput etc 
[10:10:30] <godog>	 topranks: thank you! good to have that confirmation too
[11:48:57] <ottomata>	 elukey: ...if you delete the /srv/kafka/data/codfw.resource-purge-3 directory on kafka-main2001, will the replica just re-sync the full partition from the leader?
[11:49:14] <ottomata>	 maybe best to do that while broker offline?  
[11:49:28] <ottomata>	 i have some possibly incorrect memory of doing that before
[11:52:18] <elukey>	 ottomata: o/ I thought about the same, but I've read reports saying that you'd also need to clean up zookeeper (if any, for partition assignment) and after restarting the broker metadata may be messy
[11:52:35] <elukey>	 in theory the alternative is to way for compaction to do its magic
[11:52:52] <elukey>	 but it means other 3 days of this state
[11:53:04] <elukey>	 (if we don't modify retention for that topic temporarily)
[12:17:14] <elukey>	 (nothing relevant on zk)
[12:17:31] <elukey>	 basically I checked https://stackoverflow.com/questions/64514851/apache-kafka-kafka-common-offsetsoutoforderexception-when-reassigning-consume
[12:17:39] <elukey>	 that seems to fix
[12:17:42] <elukey>	 *fit
[12:18:56] <elukey>	 in theory testing the disk partition move to another location and restart could work, if not we re-add it in its original place
[13:04:24] <ottomata>	 aye makes sense
[14:29:32] <_joe_>	 oh btw
[14:29:48] <_joe_>	 resource purge data can be lost to little harm
[14:51:18] <elukey>	 hello on-callers, I'd like to attempt again to commit in the pvt repo on puppetserver1001
[14:51:25] <elukey>	 hopefully this time it will be better
[14:51:55] <elukey>	 for everybody - please don't do puppet pvt commits for 10 mins :)
[14:54:28] <volans>	 elukey: can't you hold the lock for puppetmaster1001 so to prevent others to run it?
[14:55:20] <elukey>	 volans: for pvt? How do you do it?
[14:56:14] <volans>	 ah sorry PVT... just touch DO_NOT_COMMIT.LUCA  and git add it
[14:56:39] <volans>	 people will find it in their list of added things and notice :D
[14:58:51] <elukey>	 make sense, I'll remember!
[14:59:23] <elukey>	 cleaned up, the post-commit hook hanged to contact puppetmaster1001, I missed some ferm rule probably
[16:01:38] <Southparkfan>	 I'm rebuilding the sessionstorage instance on deployment-prep, but puppet fails because the 'cassandra' posix account does not exist (in cassandra::instance)- where is this user created on production cassandra nodes?
[16:24:30] <XioNoX>	 we rolled back to netbox 3, you can resume using it again. We will investigate and see what went wrong in the upgrade. Stay tuned
[16:24:38] <sukhe>	 <3