[07:54:39] Hello folks [07:55:06] if you need Superset/Turnilo's webrequest live data please note that we had a problem in the past couple of days: https://phabricator.wikimedia.org/T331801 [07:55:24] traffic data seems normal now, but if you look back to hours ago there is only upload traffic registered [07:55:32] keep it in mind if anything occurs :) [07:55:41] rzl: --^ :) [07:55:58] (we'll need to add some traffic volume alerts to Benthos, or something similar) [08:27:33] the traffic volume reported for upload/text on webrequest_live vs 128 is still not right [08:27:48] benthos didn't go back to the previous traffic volume [08:52:43] going to test a theory with https://gerrit.wikimedia.org/r/c/operations/puppet/+/896043 [08:53:42] I have seen in the past clients getting stuck in a weird way (like varnishkafka) when tcp connections where left hanging (like a hard reboot for a cp node etc..) [08:53:53] the consumer seemed stuck in a weird state [08:54:22] now I am wondering if on the Kafka broker side, the leader of the consumer group still thinks or have assigned some partitions to centrallog1001 [08:58:06] ok 1001 is back in the consumer group [09:24:01] very weird, traffic volume increased and we got back into the "only-upload-data" weird state [09:31:28] left a note in the task about a possible alternative step, but I'll wait for some feedback before proceeding [10:06:34] also created https://gerrit.wikimedia.org/r/c/operations/puppet/+/897063 [10:45:47] ok I went forward and resetted the offsets like indicated in the task, the status of webrequest live was broken anyway [11:12:49] the situation improved a little, but benthos is now handling 1/3 of the original traffic before the centrallog1001 -> 1002 switch [11:12:52] that is very weird [11:20:30] (need to step afk, will check later( [15:10:12] elukey: oh wow, thanks so much for looking on the weekend [16:15:18] rzl: np! Sadly still not working as before, really werid [16:15:20] *weird [17:01:07] Hi, I'm here [17:01:21] Is there anything I could do to help?? [17:02:26] I made a failover of centrallog1001 -> centrallog1002 last week. [17:02:27] Do you think it may be related to this issue?? [17:07:37] Sorry in advance if I broke anything. [17:07:37] I'm digging into the issue to understand what happened. [17:11:44] denisse: o/ o/ it should be related to the move from 1001 to 1002 but it is a kafka weirdness, not your fault don't worry :) [17:11:56] I am testing a few things, and reporting in the task [17:16:19] (going afk)