[06:29:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:39:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:04:04] <XioNoX>	 Hello! do you know if it's possible to have some alerting for when this happens https://phabricator.wikimedia.org/T388641#10720712 ?
[11:22:43] <godog>	 XioNoX: mmhh in theory absent(...) though "it depends" does the whole metric disappear? or is it a partial gnmi result like some interfaces are there some aren't? or might make sense to alert on gnmi error counters if that's a thing
[11:24:53] <XioNoX>	 godog: that's what we're trying to figure out iiuc, from a prometheus point of view the metric is missing (not exposed by the exporter) for one export cycle (topranks correct me if I'm wrong)
[11:31:08] <godog>	 ack ok
[11:31:11] <XioNoX>	 topranks: in the thanos query how do you know that those times are gaps ?
[11:33:08] <XioNoX>	 looking at the host's metrics https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&orgId=1&from=now-3h&to=now or gnmic metrics there is nothing that stands out https://grafana-rw.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=eqiad&from=now-3h&to=now
[11:44:41] <godog>	 there is some kind of signal here for "prometheus isn't really getting all the samples from gnmi"
[11:44:44] <godog>	 https://w.wiki/DkeN
[11:44:54] <godog>	 I've put eqiad and codfw to show the difference
[11:45:17] <godog>	 some swings are normal but not in the 100s I'd say
[11:45:51] <godog>	 not sustained anyways, you get the idea
[11:47:06] <XioNoX>	 godog: nice, what's the unit ? why can it be negative ?
[11:47:48] <godog>	 "number of samples in the last scrape"
[11:48:02] <godog>	 the delta over 1h, so it can be negative 
[11:48:15] <XioNoX>	 right
[11:49:01] <godog>	 may need some tweaks, you get the idea though basically alert on the prometheus end of the scrape
[11:50:22] <XioNoX>	 yeah, now to figure the tweaks :)
[11:50:49] <XioNoX>	 it's interesting that it only happens in eqiad as codfw have many more targets (39 vs. 25)
[11:51:01] <XioNoX>	 https://grafana.wikimedia.org/goto/HBAoqoANR?orgId=1
[11:52:48] <godog>	 it is interesting indeed
[11:53:57] <XioNoX>	 ohh, but more verbose hosts in eqiad?
[11:55:01] <XioNoX>	 from https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic is a bit more than 275 messages/s vs. ~350 for eqiad
[12:04:32] <topranks>	 XioNoX, godog: I may be wrong so please correct 
[12:04:35] <topranks>	 *me
[12:04:57] <topranks>	 I noticed gaps in the graphs, and there are several minutes where there is no measure returned in the thanos query 
[12:05:01] <XioNoX>	 topranks: unfortunately I don't think you are :)
[12:05:08] <topranks>	 i.e. here 
[12:05:08] <topranks>	 https://phabricator.wikimedia.org/T388641#10720712
[12:05:36] <topranks>	 I assume if I look at "metric{}[30m]" in the thanos web gui I should see all the samples for that metric for the past 30 mins?
[12:06:39] <topranks>	 it's only quite minor, I've spotted two gaps in the past week.  but in weeks prior to that I didn't see any, so I guess my worry is we're at or close to some limitation on the gnmic processing side (symptoms basically same as previous)
[12:06:51] <XioNoX>	 yeah agreed
[12:07:02] <XioNoX>	 better iron them out before adding more metrics
[12:07:58] <topranks>	 yeah exactly figured it was best to address sooner rather than later 
[12:08:06] <topranks>	 last time increasing the thread count worked, and it may do again 
[12:08:08] <XioNoX>	 topranks: one thing is that I'm surprised bu the number of metrics for cr1/2-eqiad https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All&from=now-30m&to=now
[12:08:18] <topranks>	 though it'd be good to understand what limitation we are hitting and have some measurement of it 
[12:08:23] <XioNoX>	 it might be because we have lots of 10G linecards
[12:08:37] <topranks>	 I think it's mainly the BGP stats for all the IX peers 
[12:08:51] <topranks>	 certainly we've a lot more than for codfw and that was a part of the problem before 
[12:09:10] <XioNoX>	 ok, makes sens
[12:16:33] <topranks>	 I think for a given metric we can find how many samples we have for a given period 
[12:16:35] <topranks>	 count_over_time(gnmi_interfaces_interface_state_counters_in_octets{instance="cloudsw1-d5-eqiad:9804", interface_name="em0", job="gnmi"}[1h])
[12:17:02] <topranks>	 ^^ for instance this returns 50, but unless I'm missing something we should be closer to 60 for an hours worth of data 
[12:17:16] <XioNoX>	 I was also looking at https://grafana.wikimedia.org/d/CgCw8jKZz/go-metrics?orgId=1&var-job=gnmic&var-instance=netflow1002:7890&from=now-1h&to=now usage is indeed higher, but is it too high?
[12:18:16] <topranks>	 I had a quick look at those and had the same question, there doesn't seem to be anything clearly showing "this is at a problem level" 
[12:18:23] <XioNoX>	 I think one of the issue with bumping the CPU or ram, is that I don't see any indicator that we're maxing up on either
[12:18:27] <topranks>	 but that may just be our ignorance of what problem levels might be 
[12:18:31] <topranks>	 true 
[12:18:46] <XioNoX>	 same with https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002
[12:18:48] <topranks>	 I guess the CPU + threads increase fixed before, which is the only reason it might work 
[12:19:17] <topranks>	 but even if it does not being able to measure how far we are away from problems is an issue
[12:20:38] <topranks>	 potentially just increasing the threads, but not cpu core count, may help.  
[12:22:29] <topranks>	 I spent some time when we last had problems staring at the gnmic threads in htop trying to see if all were being used 
[12:22:31] <topranks>	 i.e.
[12:22:35] <topranks>	 https://usercontent.irccloud-cdn.com/file/ISO1IPKj/image.png
[12:23:00] <topranks>	 and mostly it looks like that, at all times there are a few threads with 0% cpu usage, which would seem to suggest there is no point adding more, it's not making use of all it has now 
[12:23:22] <topranks>	 screenshot actually caught a good time - mostly no thread is near 100%, but there we see one with 'R' and usage 101.2% 
[12:23:35] <topranks>	 I don't really know what that means, but perhaps it's relevant 
[12:24:39] <topranks>	 Ok 'R' means running, 'S' means sleeping 
[12:26:40] <topranks>	 I think the 101% can possibly be due to VM scheduling and how htop is calculating cpu ticks 
[12:27:01] <topranks>	 but either way occasionally individual threads are at 100% on one core, again unsure if that's relevant 
[12:27:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[12:32:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[12:58:07] <XioNoX>	 topranks: I added a few more metrics to https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic but still no smoking gun
[12:59:24] <topranks>	 XioNoX: nice
[12:59:34] <topranks>	 but yeah nothing jumping out there that I can see
[13:00:19] <topranks>	 definitely not an emergency I think, gaps seem rare
[13:00:36] <XioNoX>	 agreed, but a blocker to move forward
[13:01:16] <topranks>	 it would be interesting to know if there are more frequent dropped measures than we notice
[13:01:43] <topranks>	 like for instance if we have regular gaps but graphs/alerts smooth them out so we don’t see
[13:02:22] <topranks>	 getting “50” to the count_over_time() query above sort of suggests that
[13:03:18] <XioNoX>	 topranks: there is also that graph shared earlier by godog https://w.wiki/DkeN
[13:04:54] <topranks>	 I was struggling to properly understand that one
[13:05:28] <topranks>	 it’s showing the difference in number of metrics scraped between eqiad/codfw
[13:06:02] <XioNoX>	 eh, I asked the same question earlier in that chan
[13:06:21] <XioNoX>	 XioNoX> godog: nice, what's the unit ? why can it be negative ?
[13:06:21] <XioNoX>	 <godog> "number of samples in the last scrape"
[13:06:21] <XioNoX>	 the delta over 1h, so it can be negative 
[13:08:02] <XioNoX>	 if you zoom out, the issue started on April 3rd
[13:08:15] <XioNoX>	 but afaik we haven't started to collect new metrics on that day
[13:08:31] <topranks>	 so basically it’s the change in number of metrics scraped now vs 1 hour ago?
[13:08:48] <XioNoX>	 that's how I understand it, yeah
[13:09:01] <topranks>	 Ok right so useful
[13:09:29] <topranks>	 I noticed it Thursday when doing the WMCS work first, April 3rd
[13:09:29] <XioNoX>	 latest change has been made on the 1st
[13:09:56] <topranks>	 Ok.  And puppet restarts the service nowadays doesn’t it? 
[13:10:01] <XioNoX>	 yeah
[13:11:25] <topranks>	 also sound mention - what I noticed before
[13:11:45] <topranks>	 Where we have multiple Prometheus instances we can have different “gaps” in each
[13:12:47] <topranks>	 it can manifest if using the Prometheus data source (behind lvs afaik) as gaps suddenly appearing then disappearing on a graph when it refreshes, depending on which prom instance it hits
[13:14:00] <topranks>	 must be something to do with when it polls and what stage gnmic is at with processing
[13:14:24] <XioNoX>	 one even that barely match is c2-eqord upgrade : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&from=1743638400000&to=1743724799000&var-site=eqiad
[13:14:56] <XioNoX>	 maybe more metrics were available in the new version and it brought it above the limit?
[13:15:46] <XioNoX>	 probably a question for the gNMIc devs, I'll need to think on how exactly I ask
[13:17:50] <topranks>	 ah yes… could we just nudged over some limit
[13:19:32] <topranks>	 though the “rate of responses” for cr2-eqord doesn’t seem to have changed
[13:20:05] <topranks>	 but yeah I think we might need to ask the gnmic devs
[13:24:30] <topranks>	 XioNoX: also on April 3rd I added the new BGP sessions to the cloudsw
[13:24:51] <topranks>	 in total that’s quite a low number, but a small increase in metrics from that anyway
[13:27:30] <XioNoX>	 ok, yeah
[13:31:08] <godog>	 yeah in terms of detecting problems the expression above I think can be used as a proxy/alert
[13:32:26] <godog>	 maybe sth like this https://w.wiki/Dkgc
[13:33:05] <godog>	 not sure about the exact threshold actually, maybe 100
[13:35:36] <topranks>	 yeah it’s a good one alright.  The numbers will fluctuate (as interfaces get enabled, bgp peering are added/removed etc) but we can probably get a good threshold for it
[13:37:04] <godog>	 yes 100 + "for 15m" or sth similar seems to work well
[14:08:27] <XioNoX>	 topranks: https://github.com/openconfig/gnmic/issues/640 let's see what comes back
[14:08:51] <XioNoX>	 and let me know if I can improve it
[14:09:17] <topranks>	 good stuff!
[14:10:31] <topranks>	 I think we might want to add that the likely culprit is the event-value-tag-v2 output processor 
[14:10:47] <topranks>	 even though we don't have hard evidence, that seemed to be the main driver when we had these issues before 
[14:13:16] <XioNoX>	 probably yeah, not sure if it's worth adding though
[14:15:32] <topranks>	 well if we know what is driving the performance issues it's surely worth mentioning?
[14:16:00] <topranks>	 I'm trying to find if we have any data from the last time we had this to back up that conjecture 
[14:16:03] <topranks>	 this was the task 
[14:16:03] <topranks>	 https://phabricator.wikimedia.org/T386807
[14:16:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[14:16:40] <XioNoX>	 I think I'm overthinking it by trying not to risk putting them on a false track
[14:16:57] <topranks>	 yeah we want to avoid that 
[14:17:45] <topranks>	 yeah the previous time I tested it was with the old event-value-tag, not the v2 version 
[14:17:55] <topranks>	 so that remains a suspicion, we've nothing clear to back it up 
[14:18:13] <XioNoX>	 added a line to the ticket
[14:18:16] <topranks>	 let me add a comment linking the above thread and referencing the previous performance optimisation thread we had with them anyway 
[14:18:18] <topranks>	 ah ok cool 
[14:18:27] <XioNoX>	 feel free to add more of course
[14:18:32] <topranks>	 that'll point them in that direction anyway, which seems a sensible place to start 
[14:21:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[18:14:49] <jinxer-wm>	 FIRING: ThanosQueryHttpRequestQueryErrorRateHigh: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryErrorRateHigh
[18:19:49] <jinxer-wm>	 RESOLVED: ThanosQueryHttpRequestQueryErrorRateHigh: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryErrorRateHigh
[18:36:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[18:41:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[18:55:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[19:00:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[21:20:36] <cwhite>	 I just deployed a patch that should significantly reduce the indexing errors we've seen lately.  Will keep watch 👀
[21:21:02] <denisse>	 cwhite: Nice!! Thank you!! ^^
[22:41:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:51:40] <jinxer-wm>	 RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag