[08:26:22] if anyone has opinions on adding fdfind and ripgrep to standard packages: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050257 [08:30:36] I use ripgrep and find it very useful, didn't knew about fd-find but I'll look at it [08:30:45] (👍 for me) [08:49:47] moritzm: +1'd [08:51:03] thx all, I'll wait at least a day for other comments [08:57:11] now what do we do with atuin :-) [09:05:30] I was thinking about the benthos issue from yesterday. One thing that gets me worries is the apparent lack of client/consumer/producer kafka metrics, preventing us from diagnosing performance issues, compared to librdkakfa-based libraries (https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md) [09:06:26] you mean on the benthos side? [09:06:39] I'm not 100% clear whether a) the metrics exist and I can't find them b) the metrics exist and we're not collecting them c) the metrics are not reported at all [09:06:40] yes [09:07:52] yeah, it's only a generic "input/processing/output" metrics, split by eventual channels [09:07:58] exactly [09:07:59] I'll have a look at getting metrics, currently benthos isn't configured to report much detail [09:08:02] but nothing specific to kafka [09:08:20] I think it can be told to report more [09:08:49] btw, if profiling (debugging) is enabled on benthos, we found that calling the http endpoint to generate a pprof file is usually helpful in diagnosis some situations [09:09:06] that's my #1 concern, because in a world where benthos is responsible for carrying webrequest data, we'd need these metrics for diagnosis purposes [09:09:28] fabfur: ack, that's a good source of info as well 👍 [09:09:41] https://docs.redpanda.com/redpanda-connect/components/http/about/#debug-endpoints [09:10:50] we found this useful in the past to check whether benthos was spending much of it's time [09:11:20] what I'd ideally like to see would be timeseries related to the message batch size, commit duration, etc. https://github.com/confluentinc/librdkafka/blob/master/STATISTICS.md has a list of useful metrics, but benthos seems to be using its own kafka library [09:11:50] being made by redpanda, a kafka alternative, I understand why they wouldn't want to use the core library written by confluent. [09:12:20] Anyway, I'll try to have a look at whether franz-go can expose prometheus metrics for kafka consumer/producers [09:12:39] benthos can use 2 different kafka libraries, and none of those 2 is based on librdkafka [09:13:20] https://github.com/twmb/franz-go mentions `Plug-in metrics support for prometheus, zap, etc.` so maybe we could have benthos register the franz-go prometheus registry as well? [09:14:14] https://github.com/twmb/franz-go/blob/master/plugin/kprom/kprom.go [09:20:24] oh and sarama is pulled as a dependency as well https://github.com/redpanda-data/connect/blob/main/go.mod#L18. I think that this is the client library used when using the `kafka` input and franz is used when using `kafka_franz` [09:22:13] it also exposes prometheus metrics. What I don't know is whether benthos has a way to include this registry into its own [09:38:38] moritzm: about the only opinion about the fdfind/ripgrep change I have is that the list is not sorted alphabetically ;) [09:42:06] arturo: What is atuin in this context, please? [09:43:04] btullis: https://atuin.sh/ is a terminal history manager. It changed my [terminal] life [09:43:44] basically a fuzzy finder for the terminal history [09:45:11] Oh, thanks. I will check it out, although I think I may need more precision and less fuzz in my terminal :-) [09:45:38] :-P [10:03:17] <_joe_> brouberol: wait, benthos is NOT using librdkafka? [10:03:19] <_joe_> eeek [10:03:43] <_joe_> I did some tests back in the day on kafka library performance in go [10:03:46] <_joe_> and well [10:04:06] <_joe_> kafka-go was like 1 order of magnitude faster back then [10:04:14] <_joe_> it was like 4 years ago, but still [10:10:05] _joe_: nope, it either uses sarama, the first go/kafka library by shopify, or franz-go, which I don't have experience with [10:10:49] yep, kafka-go is mainly a go wrapper around librdkafka. Sarama used to have a pretty poor performance compared to it as well [10:10:54] <_joe_> oh dear [10:10:58] <_joe_> yes [10:11:16] <_joe_> ok that could explain why benthos lags so much on that topic [10:11:29] kamila_: other idea, could we try to leverage the kafka_franz input instead of kafka, to see whether we could squeeze better perf from franz-go than with sarama? [10:11:43] brouberol: it is kafka_franz actually [10:11:49] oh it is? [10:11:52] gotcha [10:11:59] <_joe_> franz_kafka would've been better [10:12:23] we could change the code. It wouldn't be too much of a metamorphosis [10:12:34] oh god, that'd bring about some huge bugs [10:12:50] <_joe_> just don't try to fix them with an apple [10:13:21] that's a wrap folks, standup comedy hour it is! [10:14:17] <_joe_> oh it's just part of my "make my father turn in his grave" routine. He definitely didn't think Franz Kafka could be used as joke material [10:15:04] and I was thinking about what fabfur said: in the absence of metrics, we could always pprof and get a sense of where we're spending time. [10:15:23] I'm not sure why we have an absence of metrics actually :D trying to figure that out [10:15:31] TBH I think Kafka himself wouldn't necessary understand why he;s been trending in google search for a while [10:15:34] or is kafka_franz really not outputting any? [10:15:41] :D [10:17:57] <_joe_> I have to say, as a go programmer, when you see the internals of kafka-go you cringe very hard. But OTOH it's the reference implementation under the hood, and it's well written C [10:18:39] <_joe_> the reason why project like benthos try to avoid it is to avoid having to bundle librdkafka in the executable and thus not being able to create a single release per os/platform [10:23:50] but OTOH Benthos could expose other components metrics :) [11:46:32] I'm going to try resetting the CG offset (a simple restart didn't do anything), not quite sure what that's going to do with the MW metrics it's outputting, but it's not like they're correct now anyway '^^ (plus they're all counters, so it shouldn't matter much) [11:53:55] I think Puppet just broke. At least on some hosts like cumin and deploy1002 [11:56:30] Or maybe it's just me... [11:58:39] yeah, I don't understand why my pcc run fails :-/ [11:58:57] The error is: Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse inline template: certificate verify failed [self-signed certificate in certificate chain for CN=Puppet CA: puppet] (file: /srv/jenkins/puppet-compiler/3086/change/src/modules/mediawiki/manifests/web/yaml_defs.pp, line: 26, column: 30) on node deploy1002.eqiad.wmnet [11:58:57] klausman: which PCC run? [11:59:03] https://puppet-compiler.wmflabs.org/output/1050328/3086/deploy1002.eqiad.wmnet/index.html [11:59:09] that sounds like a PCC issue [11:59:29] The prod change failing I get, I am trying to fix it. But my change doesn't even touch certs [11:59:47] ah, my bad! [11:59:57] I had assumed deploy1002 was on P7, but it's still P5 [12:00:20] Running with -P 5 now [12:00:31] taavi: thanks for rubberducking :) [12:04:27] deploy1002 is on buster, so it can't use Puppet 7 [12:27:27] Even with -P5 it still fails in the same way :-/ [12:32:16] klausman: puppet is complaining in a few hosts [12:32:33] due to 3941eccfd4 apparently [12:32:43] Yes. [12:33:19] That change was missing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050328 [12:33:26] But the new change fails as well, and I am puzzled. [12:33:37] I will revert the old change and investigate [12:38:48] Ok, the revert now lets Puppet complete on deploy1002 and cumin2002, I presume the other hosts as well [12:57:17] This is puzzling. Running r-p-a on deploy1002 and deploy2002 works fine. But if I do a pcc run (with -P 5), they fail with the above cert error, both for the "prod" and "change" runs. [12:58:07] It's almost as if -P 5 is ignored :-/ [12:58:56] that sounds like a PCC issue [12:59:01] yep [12:59:04] as in, pcc itself being broken [12:59:05] [puppet7-compiler-node] $ /bin/bash -xe /tmp/jenkins11234484152952402995.sh [12:59:19] Though I am not sure how much of a proof that log line is [13:05:28] Yep, that flag does nothing. [13:07:22] Sooo, how does one do a PCC run with P5, then? [13:09:06] klausman: from https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/, select only P5? [13:10:03] Ah, thanks [13:11:29] but my changes don't show up on the p5-compiler-node, at all. [13:12:32] ah, I can trigger it from Gerrit. [13:24:50] sukhe: thanks for that tip, you saved what's left of my sanity today [13:25:02] klausman: hth! and sanity++ [13:27:41] brouberol: re benthos weirdness, so I reset the CG offset and got https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=1719491225093&orgId=1&to=1719494825093&var-cluster=logging-eqiad&var-consumer_group=benthos-mw-accesslog-metrics&var-datasource=eqiad+prometheus%2Fops&var-topic=All :D [13:27:52] I don't know how to interpret this other than "wat" [13:28:33] Did you reset it to earliest or latest? [13:28:38] earliest [13:28:51] It looks like earliest, meaning that each member needs to re-read all data [13:28:51] is that indicative of bad data somehow? [13:29:45] it's difficult to say, as we don't have read rate metrics, and the lag does not seem to decrease [13:29:54] okay, so now I get to wait? sorry, I'm not experienced with kafka at all [13:30:00] so, it. could be doing nothing, or could be doing its reading very slowly [13:30:05] yeah [13:30:08] I'd say, yep, let it simmer for a while [13:30:13] ack, thank you [13:30:14] see how it evolves [13:30:16] np [13:31:14] it does seem to be making some progress, if you hover the mouse over the line over the span of the last 10 min [13:31:18] it's just slow [13:31:23] yes, it's going down a little [13:31:37] but it did that a few times already, for a while [13:31:48] I wonder if we could tweak the batche size/timeout to fetch more messages from the topic [13:32:04] yes, I don't think it's doing any batching at all atm [13:32:09] (again, it's a shot in the dark without concrete knowledge of what it's doing) [13:32:23] I can do that, it's just weird that it suddenly started happening [13:32:40] totally [13:32:59] I'm seeing [13:32:59] --- [13:32:59] batching: [13:32:59] count: 2048 [13:32:59] period: 100ms [13:33:00] --- [13:33:00] in the config [13:33:11] ah, yeah, whatever the defaults are [13:33:54] (it's been a while, I haven't touched it for a long time since it was happy, and apparently I forgot everything '^^) [13:34:15] I dream of the day I'd be able to say that about Computers [13:34:32] yeaah :D [13:35:38] Idea for an SRE-oriented horror story: all over the world, all shell and browser histories disappear overnight [13:36:09] chatgpt would surely be able to help you out [13:36:29] It would help you to a cliff you can throw yourself off of :) [13:36:58] load-bearing shell histories [13:47:13] Lucas_WMDE: b.black has opinions about this :) [14:38:42] you can have my shell history when you pry it from my cold dead hands! [14:40:20] ^ [14:54:57] * urandom probably should have put a smiley face on the end of that... [18:00:18] FYI, I've deployed a change to mw-on-k8s that renames the "local_service" envoy cluster to the service-specific "LOCAL_{release-name}" (e.g., LOCAL_mediawiki-canary). [18:00:18] I believe I've fixed all dashboards and alerts that explicitly use "local_service" when querying mw-related envoy metrics. however, if you encounter a graph that looks suspiciously empty after 17:45 UTC today, please let me know :) [21:22:18] I've concluded my brain cannot remember puppet-disable vs disable-puppet, I get it wrong more than 50% of the time [21:24:46] jhathaway: just add an alias to your dotfiles in prod ;) [21:28:13] then I'll also have to remember not to use sudo \o/, tempted to go on a rant on how I dislike sudo and its interaction with your shell [21:28:17] for me it's always "Ctrl+R dis" [21:31:49] much shorter ;)