[08:35:41] <jelto>	 GitLab needs a short maintenance break in one hour (5 minutes)
[08:41:26] <jynus>	 cloudcontrol2001-dev is failing its backups. Not super worried about it, but asking cloud team if it will be for long so I can hide those alerts to avoid alert spam
[08:49:47] <_joe_>	 jelto: can you please check that you won't interfere with scap deployments?
[08:54:19] <jelto>	 _joe_ thanks for the hint. I'll reschedule the gitlab upgrade to later today
[11:06:56] <jelto>	 GitLab needs a short maintenance break in one hour (for around 5 minutes)
[11:15:48] <volans>	 k
[13:53:04] <elukey>	 hello folks
[13:53:31] <elukey>	 me and Filippo are ready to replace the kafka client used by Benthos (kafka -> kafka_franz) as suggested by upstream
[13:53:42] <elukey>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/919064 (need to install the new benthos on centrallog  nodes too)
[13:54:21] <elukey>	 the idea is to test the new client with the same setting that we have now, and then in a couple of days (if nothings explodes) we reduce partitions and adjust sampling
[13:54:41] <elukey>	 so we don't pull all the webrequest_{text,upload} data but only a slice
[13:55:00] <elukey>	 ok for oncallers?
[13:55:16] <volans>	 go ahead, thx for the headsup!
[13:55:51] <_joe_>	 it's ok for me too
[13:55:56] <elukey>	 super thanks
[13:56:04] <_joe_>	 as long as you promise to help me and hnowlan with some benthos 
[13:56:26] <volans>	 lol
[13:58:35] * elukey directs Giuseppe to Filippo
[13:58:51] <elukey>	 benthos upgraded, will leave it running for a few and then apply the patch later on
[14:00:02] * elukey stares at https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=now-1h&to=now
[14:00:51] <godog>	 very cool, thank you elukey !
[14:00:58] <godog>	 also yeah happy to help with Benthos
[15:41:42] <elukey>	 of course having two different kafka clients in the same consumer group didn't work, we lost some events when benthos on centrallog1002 misbehaved (meanwhile the old version kept running on 2002)
[15:42:11] <elukey>	 I have a solution in mind but it would require stopping all benthos clients, delete the consumer group in kafka and restart them
[15:43:13] <volans>	 thxfor the update
[15:43:14] <ottomata>	 Interesting, Since brokers handle balancing consumers, I would have thought it wouldn't care.  is client.id considered in group balancer?
[15:45:03] <elukey>	 ottomata: IIUC I think that the two clients don't have the same way of sharing partitions, and the latter fails when joining
[15:45:07] <elukey>	 I see this error:
[15:45:28] <elukey>	 INCONSISTENT_GROUP_PROTOCOL: The group member's supported protocols are incompatible with those of existing members or first group member tried to join with empty protocol type or empty protocol list.
[15:46:38] <elukey>	 or maybe it is related to the client's kafka protocols supported, I am reading multiple issues under the same error on various forms
[15:48:38] <elukey>	 the other alternative is just to use a different kafka consumer group
[15:48:58] <elukey>	 (so stop all benthos, change the consumer group name, start one by one)
[15:49:30] <elukey>	 I am oriented on the latter, seems simpler
[15:51:24] <elukey>	 (https://medium.com/trendyol-tech/rebalance-and-partition-assignment-strategies-for-kafka-consumers-f50573e49609 TIL)
[15:51:44] <volans>	 as this data is kept only 24 I don't think that a small window of lack of data would be a big deal
[15:51:56] <volans>	 *24h
[15:54:33] <elukey>	 I can try the new one
[15:55:19] <elukey>	 yep all good
[15:55:31] <elukey>	 now we are running only benthos on 1002 with franz
[15:55:32] <elukey>	 let's see
[15:59:19] <elukey>	 ok both restarted running franz
[16:03:48] <_joe_>	 elukey: what is franz? a new go library for kafka?
[16:04:04] <_joe_>	 oh I see
[16:04:26] <elukey>	 _joe_ exactly yes, upstream used sarama before, but it was unstable and leading to issues so they are switching to kafka franz
[16:04:30] <_joe_>	 maybe we could try to convert purged to use it, but I fear we'd have similar problems, and we can't really lose data there
[16:04:49] <_joe_>	 ah, we're using the bindings to librdkafka instead
[16:06:28] <elukey>	 not sure why benthos doesn't use it as well, let's see how this client goes
[16:11:56] <elukey>	 CI is a little backlogged afaics, I'll merge the fix later, for the moment puppet is disabled on centrallog nodes
[16:12:02] <elukey>	 (fix being https://gerrit.wikimedia.org/r/c/operations/puppet/+/919158)
[16:16:13] <elukey>	 all fixed
[16:18:00] <elukey>	 also updated the task
[16:18:19] <elukey>	 going afk, ping me if anything doesn't look right :)
[16:19:23] <volans>	 thanks a lot!
[16:20:11] <volans>	 we got a small dip in the data for text and almost nothing for upload fwiw
[16:22:28] <volans>	 https://phabricator.wikimedia.org/F36992438
[16:51:58] <ottomata>	 ah i see luca yeah that could do it, different protocol versions
[17:31:51] <hashar>	 (the CI got resolved, it was timing out / retrying due to the new Gerrit host lacking IPv6)
[17:31:58] <hashar>	 away &