[06:29:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:39:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:04:04] Hello! do you know if it's possible to have some alerting for when this happens https://phabricator.wikimedia.org/T388641#10720712 ? [11:22:43] XioNoX: mmhh in theory absent(...) though "it depends" does the whole metric disappear? or is it a partial gnmi result like some interfaces are there some aren't? or might make sense to alert on gnmi error counters if that's a thing [11:24:53] godog: that's what we're trying to figure out iiuc, from a prometheus point of view the metric is missing (not exposed by the exporter) for one export cycle (topranks correct me if I'm wrong) [11:31:08] ack ok [11:31:11] topranks: in the thanos query how do you know that those times are gaps ? [11:33:08] looking at the host's metrics https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&orgId=1&from=now-3h&to=now or gnmic metrics there is nothing that stands out https://grafana-rw.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=eqiad&from=now-3h&to=now [11:44:41] there is some kind of signal here for "prometheus isn't really getting all the samples from gnmi" [11:44:44] https://w.wiki/DkeN [11:44:54] I've put eqiad and codfw to show the difference [11:45:17] some swings are normal but not in the 100s I'd say [11:45:51] not sustained anyways, you get the idea [11:47:06] godog: nice, what's the unit ? why can it be negative ? [11:47:48] "number of samples in the last scrape" [11:48:02] the delta over 1h, so it can be negative [11:48:15] right [11:49:01] may need some tweaks, you get the idea though basically alert on the prometheus end of the scrape [11:50:22] yeah, now to figure the tweaks :) [11:50:49] it's interesting that it only happens in eqiad as codfw have many more targets (39 vs. 25) [11:51:01] https://grafana.wikimedia.org/goto/HBAoqoANR?orgId=1 [11:52:48] it is interesting indeed [11:53:57] ohh, but more verbose hosts in eqiad? [11:55:01] from https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic is a bit more than 275 messages/s vs. ~350 for eqiad [12:04:32] XioNoX, godog: I may be wrong so please correct [12:04:35] *me [12:04:57] I noticed gaps in the graphs, and there are several minutes where there is no measure returned in the thanos query [12:05:01] topranks: unfortunately I don't think you are :) [12:05:08] i.e. here [12:05:08] https://phabricator.wikimedia.org/T388641#10720712 [12:05:36] I assume if I look at "metric{}[30m]" in the thanos web gui I should see all the samples for that metric for the past 30 mins? [12:06:39] it's only quite minor, I've spotted two gaps in the past week. but in weeks prior to that I didn't see any, so I guess my worry is we're at or close to some limitation on the gnmic processing side (symptoms basically same as previous) [12:06:51] yeah agreed [12:07:02] better iron them out before adding more metrics [12:07:58] yeah exactly figured it was best to address sooner rather than later [12:08:06] last time increasing the thread count worked, and it may do again [12:08:08] topranks: one thing is that I'm surprised bu the number of metrics for cr1/2-eqiad https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All&from=now-30m&to=now [12:08:18] though it'd be good to understand what limitation we are hitting and have some measurement of it [12:08:23] it might be because we have lots of 10G linecards [12:08:37] I think it's mainly the BGP stats for all the IX peers [12:08:51] certainly we've a lot more than for codfw and that was a part of the problem before [12:09:10] ok, makes sens [12:16:33] I think for a given metric we can find how many samples we have for a given period [12:16:35] count_over_time(gnmi_interfaces_interface_state_counters_in_octets{instance="cloudsw1-d5-eqiad:9804", interface_name="em0", job="gnmi"}[1h]) [12:17:02] ^^ for instance this returns 50, but unless I'm missing something we should be closer to 60 for an hours worth of data [12:17:16] I was also looking at https://grafana.wikimedia.org/d/CgCw8jKZz/go-metrics?orgId=1&var-job=gnmic&var-instance=netflow1002:7890&from=now-1h&to=now usage is indeed higher, but is it too high? [12:18:16] I had a quick look at those and had the same question, there doesn't seem to be anything clearly showing "this is at a problem level" [12:18:23] I think one of the issue with bumping the CPU or ram, is that I don't see any indicator that we're maxing up on either [12:18:27] but that may just be our ignorance of what problem levels might be [12:18:31] true [12:18:46] same with https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002 [12:18:48] I guess the CPU + threads increase fixed before, which is the only reason it might work [12:19:17] but even if it does not being able to measure how far we are away from problems is an issue [12:20:38] potentially just increasing the threads, but not cpu core count, may help. [12:22:29] I spent some time when we last had problems staring at the gnmic threads in htop trying to see if all were being used [12:22:31] i.e. [12:22:35] https://usercontent.irccloud-cdn.com/file/ISO1IPKj/image.png [12:23:00] and mostly it looks like that, at all times there are a few threads with 0% cpu usage, which would seem to suggest there is no point adding more, it's not making use of all it has now [12:23:22] screenshot actually caught a good time - mostly no thread is near 100%, but there we see one with 'R' and usage 101.2% [12:23:35] I don't really know what that means, but perhaps it's relevant [12:24:39] Ok 'R' means running, 'S' means sleeping [12:26:40] I think the 101% can possibly be due to VM scheduling and how htop is calculating cpu ticks [12:27:01] but either way occasionally individual threads are at 100% on one core, again unsure if that's relevant [12:27:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:32:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:58:07] topranks: I added a few more metrics to https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic but still no smoking gun [12:59:24] XioNoX: nice [12:59:34] but yeah nothing jumping out there that I can see [13:00:19] definitely not an emergency I think, gaps seem rare [13:00:36] agreed, but a blocker to move forward [13:01:16] it would be interesting to know if there are more frequent dropped measures than we notice [13:01:43] like for instance if we have regular gaps but graphs/alerts smooth them out so we don’t see [13:02:22] getting “50” to the count_over_time() query above sort of suggests that [13:03:18] topranks: there is also that graph shared earlier by godog https://w.wiki/DkeN [13:04:54] I was struggling to properly understand that one [13:05:28] it’s showing the difference in number of metrics scraped between eqiad/codfw [13:06:02] eh, I asked the same question earlier in that chan [13:06:21] XioNoX> godog: nice, what's the unit ? why can it be negative ? [13:06:21] "number of samples in the last scrape" [13:06:21] the delta over 1h, so it can be negative [13:08:02] if you zoom out, the issue started on April 3rd [13:08:15] but afaik we haven't started to collect new metrics on that day [13:08:31] so basically it’s the change in number of metrics scraped now vs 1 hour ago? [13:08:48] that's how I understand it, yeah [13:09:01] Ok right so useful [13:09:29] I noticed it Thursday when doing the WMCS work first, April 3rd [13:09:29] latest change has been made on the 1st [13:09:56] Ok. And puppet restarts the service nowadays doesn’t it? [13:10:01] yeah [13:11:25] also sound mention - what I noticed before [13:11:45] Where we have multiple Prometheus instances we can have different “gaps” in each [13:12:47] it can manifest if using the Prometheus data source (behind lvs afaik) as gaps suddenly appearing then disappearing on a graph when it refreshes, depending on which prom instance it hits [13:14:00] must be something to do with when it polls and what stage gnmic is at with processing [13:14:24] one even that barely match is c2-eqord upgrade : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&from=1743638400000&to=1743724799000&var-site=eqiad [13:14:56] maybe more metrics were available in the new version and it brought it above the limit? [13:15:46] probably a question for the gNMIc devs, I'll need to think on how exactly I ask [13:17:50] ah yes… could we just nudged over some limit [13:19:32] though the “rate of responses” for cr2-eqord doesn’t seem to have changed [13:20:05] but yeah I think we might need to ask the gnmic devs [13:24:30] XioNoX: also on April 3rd I added the new BGP sessions to the cloudsw [13:24:51] in total that’s quite a low number, but a small increase in metrics from that anyway [13:27:30] ok, yeah [13:31:08] yeah in terms of detecting problems the expression above I think can be used as a proxy/alert [13:32:26] maybe sth like this https://w.wiki/Dkgc [13:33:05] not sure about the exact threshold actually, maybe 100 [13:35:36] yeah it’s a good one alright. The numbers will fluctuate (as interfaces get enabled, bgp peering are added/removed etc) but we can probably get a good threshold for it [13:37:04] yes 100 + "for 15m" or sth similar seems to work well [14:08:27] topranks: https://github.com/openconfig/gnmic/issues/640 let's see what comes back [14:08:51] and let me know if I can improve it [14:09:17] good stuff! [14:10:31] I think we might want to add that the likely culprit is the event-value-tag-v2 output processor [14:10:47] even though we don't have hard evidence, that seemed to be the main driver when we had these issues before [14:13:16] probably yeah, not sure if it's worth adding though [14:15:32] well if we know what is driving the performance issues it's surely worth mentioning? [14:16:00] I'm trying to find if we have any data from the last time we had this to back up that conjecture [14:16:03] this was the task [14:16:03] https://phabricator.wikimedia.org/T386807 [14:16:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [14:16:40] I think I'm overthinking it by trying not to risk putting them on a false track [14:16:57] yeah we want to avoid that [14:17:45] yeah the previous time I tested it was with the old event-value-tag, not the v2 version [14:17:55] so that remains a suspicion, we've nothing clear to back it up [14:18:13] added a line to the ticket [14:18:16] let me add a comment linking the above thread and referencing the previous performance optimisation thread we had with them anyway [14:18:18] ah ok cool [14:18:27] feel free to add more of course [14:18:32] that'll point them in that direction anyway, which seems a sensible place to start [14:21:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:14:49] FIRING: ThanosQueryHttpRequestQueryErrorRateHigh: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryErrorRateHigh [18:19:49] RESOLVED: ThanosQueryHttpRequestQueryErrorRateHigh: Thanos Query Frontend is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryErrorRateHigh [18:36:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:41:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [18:55:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [19:00:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [21:20:36] I just deployed a patch that should significantly reduce the indexing errors we've seen lately. Will keep watch 👀 [21:21:02] cwhite: Nice!! Thank you!! ^^ [22:41:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:51:40] RESOLVED: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag