[15:17:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [15:22:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [16:16:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:03:28] hi o11y, following the deploy of wmf.11, I've just turned on mediawiki-native traces for group1 wikis [17:04:36] I am going to turn the sampling rates down a bit, as I think we're gathering way more than is useful -- but we'll still be sending more trace data into opensearch anyway, so let me know if that starts causing any issues [17:07:34] ack, thank you cdanis ! [17:34:14] question: I have a service whose logs don't seem to be getting parsed correct — https://w.wiki/CfuE [17:34:38] It's (supposed to be) ecs, but the log object isn't being decoded [17:35:30] are we/did we do something wrong here? [17:35:48] I'd like to be able to filter on log level [17:36:03] Yeah, this doesn't seem right: {"level":"INFO"} [17:36:37] see also: https://www.elastic.co/guide/en/ecs/current/ecs-log.html#field-log-level [17:36:48] urandom: Could you share the Gerrit link to the piece of code implementation emitting those logs? [17:37:35] denisse: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/servicelib-golang/+/refs/heads/master/logger/logger.go [17:38:03] Also, looking at the JSON object received in Logstash it seems to have two levels: [17:38:07] https://www.irccloud.com/pastebin/tkEQxDAq/ [17:38:16] yes, exactly [17:38:24] same as the service object [17:38:31] (which *is* being parsed) [17:39:12] denisse: this is my understanding of the ECS here, it's based on this: https://www.elastic.co/guide/en/ecs/current/ecs-log.html#field-log-level [17:40:08] urandom: just to double-check, the logs you're generating have an ecs.version field? https://doc.wikimedia.org/ecs/#field-ecs-version [17:40:57] cwhite: they do not... [17:41:13] "ECS version this event conforms to. ecs.version is a required field and must exist in all events." [17:41:31] ahh, yeah you'll want that. It's how we detect ECS events :) [17:42:04] Ok, if I change that service library...what happens? [17:42:23] anything that uses it will have it's log parsed differently? routed elsewhere? [17:42:37] I'm trying to establish what is going to break :) [17:43:39] Adding the field means they'll be routed to the ECS indexes. If there are dashboards that assume these logs are legacy, then those will break. ECS and legacy are very much incompatible. [17:43:55] awesome. [17:45:09] legit awesome in that I have my answer now (thanks for that), sardonically awesome in that I have stumbled into a larger problem than I realized. :) [17:45:40] :( [17:51:07] On the other hand, correcting it will make it work correctly :) [17:51:48] indeed. :) [17:52:35] cwhite: any constraints on the version of the ecs to use? [17:52:48] s/the ecs/ecs/ [17:53:50] as-in, at least version X, and nothing newer than Y...? [17:54:27] We support 1.11.0 and below (https://doc.wikimedia.org/ecs/#_overview). Patch version is our overlaid modifications. [17:56:13] IOW, omit the `-7` as is not needed as the value for ecs.version. [17:56:51] perfect, thanks! [20:16:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag