[05:13:48] (PuppetZeroResources) firing: Puppet has failed generate resources on alert2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:22:25] (SystemdUnitFailed) firing: (2) arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on alert2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:22:25] (SystemdUnitFailed) resolved: (2) arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:48] (PuppetZeroResources) firing: Puppet has failed generate resources on alert1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [08:58:37] (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:02:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on alert1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:08:37] (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:09:35] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [09:14:35] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [09:23:37] (LogstashKafkaConsumerLag) resolved: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:32:12] I would assume this is me rolling out the new rsyslog exporter to wikikube nodes - I refrained from doing a fleet wide upgrade as of now beacuse I'm a bit afraid that restarting all rsyslog deamons at more or less the same time will set someting on fire down the line... [09:33:13] thank you jayme, yeah I would assume so too [09:33:35] no idea how to do a slow rollout with debdeploy, though [09:34:56] good question and me neither, though I think doing the same with cumin would work too [09:35:12] godog: I was wondering: Do you know what happens if the rsyslog-exporter process is killed during rsyslogd runtime? [09:35:35] maybe it would be enough to just do that instead of restaring rsyslog on package upgrades [09:36:16] good question, I think/hope rsyslog would restart the process, worth a try on sretest1001 [09:37:29] I'll try [09:38:07] jayme: yeah rsyslog restarts the exporter [09:38:41] hmm...sounds viable to just do that then...wdyt? [09:38:54] sure SGTM jayme [09:53:46] jayme: saw your PR got merged, very nice [09:54:03] yeah \o/ [09:54:39] FYI although not ideal in this case, debdeploy can be also used with -Q/--query QUERY that can be any cumin query, so you could have a bash loop that calls debdeploy multiple times with one or few hosts at a time [09:55:09] or send a patch to add support for the batch size, it's using cumin underneath ;) [09:57:02] that'd be https://gerrit.wikimedia.org/r/c/operations/debs/debdeploy/+/719470 for anyone interested in reviving the patch [10:01:11] ah right, sorry I totally forgot about that, why abandoned? the idea seems reasonable to me :) [10:05:28] I was winter cleaning my gerrit queue [10:15:48] uuh :) [10:16:04] _allegedly_ spring is coming, then? [10:16:27] i've some long standing debdeploy CR as well https://gerrit.wikimedia.org/r/c/operations/debs/debdeploy/+/643912 :D [10:18:17] I feel there is an SCCS joke in here somewhere. [14:57:48] godog: https://grafana-rw.wikimedia.org/d/KimNkFTIk/jayme-omkafka?orgId=1&from=1708591594185&to=1708593825943&var-datasource=thanos&var-site=codfw&var-cluster=kubernetes [14:58:30] jayme: very cool! thank you for putting that together [14:58:36] might be what a mw deployment looks like, have not checked - but impressive numbers :o [14:59:00] lol indeed [14:59:18] I've added the basic data to the host rsyslog overview dashboard as well [14:59:29] <3 <3 [14:59:47] the one above is just as playground for me to maybe spot the issues in the k8s clusters [15:00:33] yeah makes sense [15:14:16] unfortunately all timing metrics are always 0 in our fleet - maybe that's something that does not yet work with out rsyslog version... [15:15:09] rsyslog 8.28.0 also adds a option to resubmit logs to kafka in case of (recoverable) failures - which might be interesting [15:17:09] ah, kafka stats need to be explicitely enabled in librdkafka parameters...damn [15:17:19] definitely, I'm confused on the version though as rsyslog switched to 8.YYMM scheme [15:17:58] dunno, thats what the docs say :D https://www.rsyslog.com/doc/configuration/modules/omkafka.html#resubmitonfailure [15:18:21] but I always get confused by rsyslog versions...so maybe they do as well :D [15:18:39] ah got it, nevermind for some reason I thought that was added on one of the latest versions we're not running yet [15:18:42] glad that's not the case [15:51:08] Of amypme [15:59:48] sorry! Meant to say, could anyone help me understand why this is failing CI? https://gerrit.wikimedia.org/r/c/operations/alerts/+/1005791 [16:04:55] inflatador: I checked the CI logs and the alert isn't firing anymore, the test value is now off by one so either that +1 or switch from > to >= in the alert [16:08:07] also you should be able to switch to sth like 100 * 2^30 both in the alert and in the test, I'm not sure about the latter tho [16:19:24] godog ACK, thanks for taking a look [16:23:46] sure np inflatador