[08:35:41] GitLab needs a short maintenance break in one hour (5 minutes) [08:41:26] cloudcontrol2001-dev is failing its backups. Not super worried about it, but asking cloud team if it will be for long so I can hide those alerts to avoid alert spam [08:49:47] <_joe_> jelto: can you please check that you won't interfere with scap deployments? [08:54:19] _joe_ thanks for the hint. I'll reschedule the gitlab upgrade to later today [11:06:56] GitLab needs a short maintenance break in one hour (for around 5 minutes) [11:15:48] k [13:53:04] hello folks [13:53:31] me and Filippo are ready to replace the kafka client used by Benthos (kafka -> kafka_franz) as suggested by upstream [13:53:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/919064 (need to install the new benthos on centrallog nodes too) [13:54:21] the idea is to test the new client with the same setting that we have now, and then in a couple of days (if nothings explodes) we reduce partitions and adjust sampling [13:54:41] so we don't pull all the webrequest_{text,upload} data but only a slice [13:55:00] ok for oncallers? [13:55:16] go ahead, thx for the headsup! [13:55:51] <_joe_> it's ok for me too [13:55:56] super thanks [13:56:04] <_joe_> as long as you promise to help me and hnowlan with some benthos [13:56:26] lol [13:58:35] * elukey directs Giuseppe to Filippo [13:58:51] benthos upgraded, will leave it running for a few and then apply the patch later on [14:00:02] * elukey stares at https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=now-1h&to=now [14:00:51] very cool, thank you elukey ! [14:00:58] also yeah happy to help with Benthos [15:41:42] of course having two different kafka clients in the same consumer group didn't work, we lost some events when benthos on centrallog1002 misbehaved (meanwhile the old version kept running on 2002) [15:42:11] I have a solution in mind but it would require stopping all benthos clients, delete the consumer group in kafka and restart them [15:43:13] thxfor the update [15:43:14] Interesting, Since brokers handle balancing consumers, I would have thought it wouldn't care. is client.id considered in group balancer? [15:45:03] ottomata: IIUC I think that the two clients don't have the same way of sharing partitions, and the latter fails when joining [15:45:07] I see this error: [15:45:28] INCONSISTENT_GROUP_PROTOCOL: The group member's supported protocols are incompatible with those of existing members or first group member tried to join with empty protocol type or empty protocol list. [15:46:38] or maybe it is related to the client's kafka protocols supported, I am reading multiple issues under the same error on various forms [15:48:38] the other alternative is just to use a different kafka consumer group [15:48:58] (so stop all benthos, change the consumer group name, start one by one) [15:49:30] I am oriented on the latter, seems simpler [15:51:24] (https://medium.com/trendyol-tech/rebalance-and-partition-assignment-strategies-for-kafka-consumers-f50573e49609 TIL) [15:51:44] as this data is kept only 24 I don't think that a small window of lack of data would be a big deal [15:51:56] *24h [15:54:33] I can try the new one [15:55:19] yep all good [15:55:31] now we are running only benthos on 1002 with franz [15:55:32] let's see [15:59:19] ok both restarted running franz [16:03:48] <_joe_> elukey: what is franz? a new go library for kafka? [16:04:04] <_joe_> oh I see [16:04:26] _joe_ exactly yes, upstream used sarama before, but it was unstable and leading to issues so they are switching to kafka franz [16:04:30] <_joe_> maybe we could try to convert purged to use it, but I fear we'd have similar problems, and we can't really lose data there [16:04:49] <_joe_> ah, we're using the bindings to librdkafka instead [16:06:28] not sure why benthos doesn't use it as well, let's see how this client goes [16:11:56] CI is a little backlogged afaics, I'll merge the fix later, for the moment puppet is disabled on centrallog nodes [16:12:02] (fix being https://gerrit.wikimedia.org/r/c/operations/puppet/+/919158) [16:16:13] all fixed [16:18:00] also updated the task [16:18:19] going afk, ping me if anything doesn't look right :) [16:19:23] thanks a lot! [16:20:11] we got a small dip in the data for text and almost nothing for upload fwiw [16:22:28] https://phabricator.wikimedia.org/F36992438 [16:51:58] ah i see luca yeah that could do it, different protocol versions [17:31:51] (the CI got resolved, it was timing out / retrying due to the new Gerrit host lacking IPv6) [17:31:58] away &