[09:33:40] hi folks... o11y question for you. I got a fairly simple python script that fetches xlab experiments config from CDN servers, usually I'd use the prometheus client library to write some metrics and let the node exporter expose them but in this case the script runs once per minute so I'm afraid that the scraping frequency won't be enough to get fetch the data points. How could I expose some metrics about this without [09:33:40] running a daemon that exposes the usual metrics endpoint? [09:43:37] Hi @vgutierrez, the scrape interval for the node job uses the default scrape_interval, which is 1 minute [09:44:24] tappof: yeah.. so I could lose some data points [09:45:10] Yes... it might happen [09:45:50] pushgateway job also has the same scrape interval [09:46:49] so I don't think there's another way other than a dedicated job for your goal [09:57:55] mmmh vgutierrez, you could lose some data points if the scraping job is overloaded. Otherwise, I think you'll get all the data you need, maybe just "59 seconds old" [09:59:22] https://w.wiki/EJQA the scraping interval seems to be fairly consistent over time (I took a random node) [10:04:41] yeah... I'll stop obsessing about the perfect solution and I'll add some metrics and complain later if I see some missing data :) [10:07:24] Yeah, in that case, I think you'll need a dedicated daemon with its own scraping job, just like you said... [13:05:11] Quick question, in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1151200 should I just replace the codfw to eqiad to make the tests pass, or there is something smarter ? [13:05:36] it fails as my example metrics is codfw but there is a external_labels site: eqiad [13:06:26] I'll take a look XioNoX [13:07:37] <3 [13:25:55] XioNoX: AFAIK, the standard label always wins over the external label in case of overlap, so it will always be site="codfw" even if you define an external label site="eqiad" in your tests. I think you can replace externalLabels.site with labels.site in your dashboard link to make the tests pass but please double-check if this solution fits your scenario ... [13:26:27] cool, yeah [13:27:22] yeah it makes sens to me [13:27:36] hey 0lly, I thought I started routing Elasticsearch shard alerts to data-platform only in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148402 , but looking at https://alerts.wikimedia.org/?q=team%3Dsre it seems SRE is still getting these, any suggestions? [13:32:49] inflatador: I'll take a look [13:35:38] Thanks, the answer might be in my comment " I would include OpenSearch alerts as well, but I'm trying to stop the bleeding and I need to check with Observability first, since they use the same alerts"? Not sure [13:38:29] for alerts imported to icinga to prometheus the team= label is controlled by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/prometheus/icinga_exporter.yaml [13:44:21] Thanks taavi , it sounds like I need to add elastic.* and cirrussearch.* to the data-platform stanza. Will get a patch started for that [13:51:29] yes inflatador, thanks taavi. Another option could be to migrate those alerts to Prometheus/Alertmanager. We already have something in place, but the alert destinations are computed dynamically using the role_owner metric, so in this case, the destination would be search-platform. You can take a look at team-sre/opensearch.yaml if you’d like. [13:55:03] tappof cool, I'll do that as well! [15:40:08] hello o11y friends, question for you - I have a patch that changes scrape configs for the k8s prometheus instances (https://gerrit.wikimedia.org/r/1149505). was planning to: [15:40:08] * disable puppet on the 4 associated prometheus hosts [15:40:08] * pilot on a single host to verify prom does not sneeze at it [15:40:08] * apply on the other host in the same DC and verify collection looks like what I expect on prometheus-{dc}.wm.o. [15:40:09] * apply to the other 4 hosts [15:40:24] err, s/4 hosts/2 hosts/ [15:40:55] my question: does this sound reasonable to you, or would you prefer I go about this in a different way? :) [16:45:44] swfrench-wmf: That plan SGTM. [16:49:07] denisse: great, thank you! I'll get started in a bit and flag when I'm done [17:07:01] Hello 👋 [17:07:57] I had some questions around creating a SLO dashboard with Pyrra; who might be a good point of contact here? :) [17:20:02] Hi ecarg , I think that the best approach would be to ask in this channel and/or to create a Phabricator task with your questions and sharing it in here so the team can take a look. [17:20:24] Thank you! Here is task: https://phabricator.wikimedia.org/T394057 [17:20:53] lmk if I should tag differently; I just added sre-observability [17:47:07] FYI, all done with my changes. no issues encountered, but I also flagged in #wikimedia-sre in case there are surprises. thanks again, d.enisse! [17:49:03] swfrench-wmf: ACK, thank you!! [17:49:54] ecarg: Hi Grace, I took a look at the task but I'm failing to understand what is required from the o11y side. Could you please point me in the right direction?? Preferably in the task so others in the team can look at it too. [17:51:27] Additionally, the task mentions this "recent update to SLO standards in the org (moving away from Grafana and to Pyrra)" I think it would be nice to point to a source for this standards change in the org as I was unable to find info about the update on Wikitech and it's important to fully understand the context of the task. [18:08:18] Ok updated! Pls lmk if that is still unclear; also if I've tagged the wrong team... In regards to 'recent update to SLO standards' is something that was mentioned during meetings with external team members so I'm not sure where that is in writing atm [19:04:14] Hi! centrallog2002 is running out of disk space. Does someone have that on their radar? It looks like /srv's syslogs are taking the majority of space [19:18:53] brett: I don't think so, there's no task open for it. I added it to our team meeting tomorrow, thanks for the heads-up!