[09:10:34] FWIW I agree with Cole, maybe this can be integrated in the blazegraph exporter, in general sometimes there's a straight answer and in other cases we'll have to do an evaluation on a case by case basis like the above [09:20:06] jayme: I'm looking at your PR, is there an host I can see the new version in action? if not that's fine too I can use an sretest host I suppose [09:36:17] godog: no. I just dumped some impstats data into a file and used that for testing [09:36:37] godog: mw1640:/tmp/ímpstat*tee.log [10:00:43] jayme: ack, thank you I'll take a look at that too [10:02:27] I did also add this CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005449 - you think it would be fine to collect those metrics fleet wide or should I add some toggle? [10:03:02] should be fine yeah jayme, I'll have a more definite answer shortly [10:04:32] ok. pcc is still running as the selector comes up with >270 hosts 🙈 [10:09:22] ack, PR looks good to me, did a quick test on sretest1001 [10:23:05] godog: how do you gbp import new upstream versions in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/prometheus-rsyslog-exporter/ ? Is "change_upstream" the actual upstream branch that should be used? [10:26:58] jayme: I'm looking at the history and I think the way to do is to add the upstream remote then merge into master, and tag the upstream release with upstream/ then gbp should take care of the rest [10:27:10] :-o [10:28:17] longing for the day we have one max two ways of doing this stuff across all internal debian packaging [10:33:28] actually gbp workflow is fine when I choose change_upstream as remote branch [10:35:10] ah, no - it doesn't as change_upstream contains ./debian as well [10:35:12] bummer [11:00:30] yeah I think that branch shouldn't be there [11:00:54] godog: I would "fix" this by branching gbp's upstream off of upstream/0.0.0+git20201008 (which does not have the debian stuff) and then gbp import-orig -u v1.0.0-8522c38 /tmp/prometheus-community-rsyslog_exporter-v1.0.0-4-g8522c38.tar.gz [11:01:33] will also add a debian/README with instuctions for future versions of us [11:06:01] jayme: we haven't been importing upstream tarballs but rather have upstream history in 'master', either way I'm fine since the whole thing is a mess anyways [11:24:22] ofc I cant create branches because of some LOCK_FAILURE in gerrit... [11:28:28] sigh [13:09:46] godog: finally :) https://gerrit.wikimedia.org/r/c/operations/debs/prometheus-rsyslog-exporter/+/1005508 [13:11:30] (I did not delete the change_upstream branch) [13:23:36] (LogstashKafkaConsumerLag) firing: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:26:08] jayme: sweet! I'll take a look shortly [13:26:22] let's see why the above is unhappy [13:33:09] godog: I can imagine rolling out the rsyslog config patch restarted a bunch of rsyslogd processes actually in failure state [13:33:29] maybe that led to flushing a bunch of stuff when they came back [13:33:36] (LogstashKafkaConsumerLag) firing: (2) Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:35:05] ah yeah of course that makes sense jayme, ok let's see if backlog subdues [13:35:19] in the short term very likely not that is [13:48:36] (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:53:36] (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:58:36] (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:02:37] should be recovering ~soon [14:03:36] (LogstashKafkaConsumerLag) resolved: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:22:30] godog: fixed version number, updated README accordingly and made CI happy - if you have another minute ;) [15:24:48] jayme: sure! looking [15:45:27] Thanks cwhite and godog . Will discuss with my team re: integrating into existing exporter [15:58:56] godog: thanks. I'll build and roll out thre package on a couple of k8s nodes today, fleet wide update for tomorrow if all goes well [16:00:02] jayme: \o/ great, thank you [18:50:23] hi folks. I was wondering if we have a defined policy that new Icinga checks are not allowed and everything should be in Prometheus? [18:50:52] I am asking because there is one WIP in progress and some other existing Icinga checks and I wanted to know what is the timeline/deadline for moving those [19:00:21] patch in question is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005140 but we have tons of Icinga checks all over Traffic and I want to know if this should be prioritized in case we missed an email :> [19:35:08] Hi sukhe , while not a defined policy _per se_ we do plan on reducing the number of checks Icinga gradually and using Alertmanager as the place where all alerts are sent as specified in our alerting infrastructure roadmap (https://upload.wikimedia.org/wikipedia/labs/0/0a/Alerting_Infrastructure_design_document_%26_roadmap.pdf). [19:36:11] thanks for this denisse, I will read [19:36:18] For this reason creating new checks using Prometheus and Alertmanager is preferred. [19:36:18] I'll attach the documentation for Alertmanager onboarding. Don't hesitate to ask if there's anything else we can help you with. https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do? [19:36:25] thanks!