[14:39:32] hey folks! [14:39:36] if anybody has time https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112223 [14:39:52] puppet on kafkamon1003 is broken, I think due to a previous commit :( [14:41:43] elukey: LGTM [14:42:47] thanks! [14:42:48] merging [14:43:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [14:45:28] worked :) [14:48:26] Is it just me, or is Grafana not having a good time right now? [14:48:35] I'm seeing lots of `Firefox can’t establish a connection to the server at wss://grafana-rw.wikimedia.org/api/live/ws.` and equivalent in Chromium [14:48:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [14:50:10] this also recovered --^ [14:58:37] MichaelG_WMF: yes I see the websocket connection failures, though IIRC I've seen those before, the live feature doesn't work and rest does [14:58:57] MichaelG_WMF: is there something else in grafana that's currently not working ? [15:00:35] no, probably it is mostly those live features. I was hoping for them to help me understand Prometheus, but maybe that part just plain doesn't work [15:04:39] godog: right now I'm trying to figure out how I'm supposed to deal with getting my data with many different "instance" labels and all such meaningless boilerplate [15:06:10] godog: that being said on Chromium the board https://grafana-rw.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?forceLogin&forceLogin=true&from=now-7d&orgId=1&to=now is straight up not working with `handleAnnotationQueryRunnerError TypeError: d[E] is not iterable [15:06:10] at Object.i [as getUrlSearchParams] (url.ts:123:28) [15:06:10] at Object.next (queryAnalytics.ts:14:20)` errors [15:06:54] MichaelG_WMF: ok I'm taking a look [15:06:59] (but it sufficiently works on Firefox which is my main Browser anyway, so that does not matter much for me) [15:07:28] Chromium Version 132.0.6834.83 (Official Build) snap (64-bit) on Ubuntu [15:07:32] in case that is relevant [15:10:41] thank you, I could definitely reproduce the problem on chrome here too [15:11:20] bizarre that removing =true from "forcelogin=true" does remove the error for me, what linked to that MichaelG_WMF ? [15:11:52] FYI, prometheus1006 went down, there was an alert in -operations [15:11:55] godog: The sign-in button 🤷 [15:12:05] nothing in SEL, but the serial console is dead [15:12:28] MichaelG_WMF: ok will take a deeper look in a bit [15:12:34] moritzm: ack thank you, will check [15:12:42] shall I powercycle? [15:12:46] but good to know that this is something that I could try wiggling to see if thing get better [15:12:51] godog thanks! [15:12:55] moritzm: yes please [15:13:21] done [15:15:22] took an eternity to POST, but it's booting now [15:16:15] it must be getting out of bed [15:16:54] looks like we're back [15:18:56] no visible sign in system logs why it went down, though [15:19:14] indeed [15:20:21] I'd say if it happens again, let's have DC ops update firmware, otherwise ignore it as a random one off [15:21:11] moritzm: will do! [15:22:41] FIRING: [263x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [15:22:52] FIRING: [2x] ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [15:23:43] moritzm: I see prometheus came back up and running, thanks [15:25:28] MichaelG_WMF: from https://grafana.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?orgId=1 the "sign in" link reads https://grafana.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?orgId=1&forceLogin=true which ultimately redirects me to [15:25:33] https://grafana-rw.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?orgId=1&forceLogin=true though I haven't been able to reproduce the link above (on chrome) [15:27:41] RESOLVED: [296x] PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures (instance titan1001:17902) - https://wikitech.wikimedia.org/wiki/Prometheus - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRuleEvaluationFailures [15:27:52] RESOLVED: [2x] ThanosRuleHighRuleEvaluationFailures: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [16:27:46] Do we have infrastructure in place to generate emails based on specific log messages? I am looking for ways to replicate our secret repo commit emails with OpenBao's logging pipeline [16:35:37] We don't have anything like that now, but rsyslog does have a `ommail` module. [16:35:57] https://www.rsyslog.com/doc/configuration/modules/ommail.html [16:42:03] cwhite: thanks, my preference would be to trigger the emails / alerts after the audit logs had been sent to logstash. Are there any hooks on that side? [16:43:53] Logstash has an email output module: https://www.elastic.co/guide/en/logstash/7.17/plugins-outputs-email.html [16:44:45] Probably easier to work with too :) [16:45:45] godog: Ah, as far as I can tell, that double `forceLogin` might come from getting logged out (timeout) while working on Grafana -> logging in again -> the login-link adds another `forceLogin` -> all the problems [16:48:04] cwhite: thanks [20:24:18] * cwhite looking [20:29:06] cdanis: the first message being eaten being `2025-01-16T20:58:55.04940135Z stderr F (node:1) Warning: "version" is a reserved word.` ? [20:29:27] sorry no, the first message being eaten is 2025-01-16T20:58:55.598110972Z stdout F {"name":"chart-service","hostname":"chart-renderer-production-d98f949c7-dsvst","pid":1,"level":30,"msg":"Ready to serve","server":{"port":6284},"levelPath":"info","time":"2025-01-16T20:58:55.597Z","v":0} [20:29:39] Got it [20:29:53] that message wasn't being eaten previously, but is now (I did multiple deploys yesterday) [20:30:11] That message went to the DLQ because we've exceeded the total allowable fields by the index (2048) [20:31:42] how can you tell? because you know that about that index? or is the DLQ usually enabled now? [20:31:46] cdanis: https://logstash.wikimedia.org/app/discover#/doc/19d32430-85d8-11eb-8ab2-63c7f3b019fc/dlq-default-1-1.0.0-1-2025.01.16?id=iRnpcJQB60NnMROHkQZL [20:31:51] ahh [20:32:29] is the answer to switch to ECS 😅 [20:32:43] haha [20:33:06] (btw, I'm guessing https://wikitech.wikimedia.org/wiki/Logstash#Using_the_dead_letter_queue needs an update) [20:33:40] ohhh, yeah that def needs an update [20:33:47] we have it enabled all the time now [20:34:05] that's awesome, I saw that yesterday and was like "oh, too bad" [20:34:27] and now I can look at https://logstash.wikimedia.org/goto/2e6e1c1e25fac032fe1f16bd6f392aad and... well, there it is [20:35:58] ah okay, and there are a few other messages that got caught there (but it's nondeterministic on each daily index creation I'm guessing?): https://logstash.wikimedia.org/goto/1eff8ed9e499a1e3e3b58a3601f8e50a [20:36:37] That's correct. [20:37:09] Events that try to create fields that exceed the maximum field limit will be unceremoniously DLQ'd once encountered. [20:39:07] so. the answer is indeed "move to ECS" ? [20:41:29] I'd recommend that. ECS is a well-defined schema and has mitigations in place for this very issue. [20:42:01] do we have any `service-runner` services emitting ECS? [20:44:40] Good question. I don't know. Wikifunctions and abstract wiki team has been moving some things over, but I don't know if they are based on service-runner. [20:45:43] I'd love to hear the maintainership status of service-runner these days. [20:46:15] bad news: basically unchanged [20:46:30] good news: tchin and ottomata have been doing some nice work on a potential replacement for many usecases [20:47:07] Nice! [20:47:32] https://gitlab.wikimedia.org/repos/data-engineering/service-utils/ [20:48:51] amazing, it does ecs and prometeus ootb! <3 [20:50:24] cwhite: so what's the migration process? I assume I can't just add ecs.version without changing anything else, is there a quick way to test a set of logs for ECS compliance, or a staging pipeline, or? [20:53:09] Technically you could just add ecs.version and rely only on the overlapping fields. There is no other test for ecs compliance other than the presence of ecs.version. ECS messages will tell you which fields get explicitly dropped by the pipeline. [20:53:51] deployment-prep runs the production logstash configuration for beta-logs [20:55:19] The field reference is the source of truth for which fields you can use: https://doc.wikimedia.org/ecs/ [20:55:46] Note that upstream's documentation doesn't include the fields we've augmented in for our purposes. I'd rely on ^^ [20:57:18] ack [20:57:20] so, our logstash filtering prunes non-ECS fields at the top level, but not at lower levels, is that right? [20:58:08] That's correct. Deep inspection may come someday, but we've not yet seen the need yet. [22:17:35] cwhite: ... how can I apply 15-filter-node.conf to a k8s service? [22:19:07] cdanis: that filter is gated by "program" which historically was provided by rsyslog [22:19:25] right [22:19:41] I think it would work as-is on many service-runner services, since it was designed for that originally [22:19:59] in k8s, that field is... something else. input-file-kubernetes I think? [22:20:01] (although I haven't looked deeply at the overlap with the similar lines in 20-filter_syslog) [22:21:55] There isn't much overlap there in 20-filter-syslog. 15-filter-node does the ECS conversion [22:22:09] I have a potentially-interesting idea [22:22:13] (20-filter-syslog predates 15-filter-node) [22:35:57] cwhite: so, I need to actually test all this, but, I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1112295 could be quite interesting... k8s apps could just add that label to their pods and the captured logs would get auto-converted. although hm, really need need that on a per-container basis... [22:37:15] let me know what you think of the overall idea :) [22:51:32] a label or annotation on the pod could have the container name as a value [22:56:08] I'll have to circle back with you. Gotta relocate now. Sorry!