[05:13:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on alert2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:22:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:28:48] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on alert2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[07:22:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) arclamp_compress_logs.service on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:17:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on alert1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[08:58:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:02:48] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on alert1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[09:08:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:09:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[09:14:35] <jinxer-wm>	 (ThanosSidecarNoConnectionToStartedPrometheus) resolved: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus
[09:23:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (3) Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:32:12] <jayme>	 I would assume this is me rolling out the new rsyslog exporter to wikikube nodes - I refrained from doing a fleet wide upgrade as of now beacuse I'm a bit afraid that restarting all rsyslog deamons at more or less the same time will set someting on fire down the line...
[09:33:13] <godog>	 thank you jayme, yeah I would assume so too
[09:33:35] <jayme>	 no idea how to do a slow rollout with debdeploy, though
[09:34:56] <godog>	 good question and me neither, though I think doing the same with cumin would work too
[09:35:12] <jayme>	 godog: I was wondering: Do you know what happens if the rsyslog-exporter process is killed during rsyslogd runtime?
[09:35:35] <jayme>	 maybe it would be enough to just do that instead of restaring rsyslog on package upgrades
[09:36:16] <godog>	 good question, I think/hope rsyslog would restart the process, worth a try on sretest1001
[09:37:29] <godog>	 I'll try
[09:38:07] <godog>	 jayme: yeah rsyslog restarts the exporter
[09:38:41] <jayme>	 hmm...sounds viable to just do that then...wdyt?
[09:38:54] <godog>	 sure SGTM jayme 
[09:53:46] <godog>	 jayme: saw your PR got merged, very nice
[09:54:03] <jayme>	 yeah \o/
[09:54:39] <volans>	 FYI although not ideal in this case, debdeploy can be also used with -Q/--query QUERY that can be any cumin query, so you could have a bash loop that calls debdeploy multiple times with one or few hosts at a time
[09:55:09] <volans>	 or send a patch to add support for the batch size, it's using cumin underneath ;)
[09:57:02] <godog>	 that'd be https://gerrit.wikimedia.org/r/c/operations/debs/debdeploy/+/719470 for anyone interested in reviving the patch
[10:01:11] <volans>	 ah right, sorry I totally forgot about that, why abandoned? the idea seems reasonable to me :)
[10:05:28] <godog>	 I was winter cleaning my gerrit queue
[10:15:48] <jayme>	 uuh :)
[10:16:04] <klausman>	 _allegedly_ spring is coming, then?
[10:16:27] <jayme>	 i've some long standing debdeploy CR as well https://gerrit.wikimedia.org/r/c/operations/debs/debdeploy/+/643912  :D
[10:18:17] <klausman>	 I feel there is an SCCS joke in here somewhere.
[14:57:48] <jayme>	 godog: https://grafana-rw.wikimedia.org/d/KimNkFTIk/jayme-omkafka?orgId=1&from=1708591594185&to=1708593825943&var-datasource=thanos&var-site=codfw&var-cluster=kubernetes 
[14:58:30] <godog>	 jayme: very cool! thank you for putting that together
[14:58:36] <jayme>	 might be what a mw deployment looks like, have not checked - but impressive numbers :o
[14:59:00] <godog>	 lol indeed
[14:59:18] <jayme>	 I've added the basic data to the host rsyslog overview dashboard as well
[14:59:29] <godog>	 <3 <3
[14:59:47] <jayme>	 the one above is just as playground for me to maybe spot the issues in the k8s clusters
[15:00:33] <godog>	 yeah makes sense
[15:14:16] <jayme>	 unfortunately all timing metrics are always 0 in our fleet - maybe that's something that does not yet work with out rsyslog version...
[15:15:09] <jayme>	 rsyslog 8.28.0 also adds a option to resubmit logs to kafka in case of (recoverable) failures - which might be interesting
[15:17:09] <jayme>	 ah, kafka stats need to be explicitely enabled in librdkafka parameters...damn
[15:17:19] <godog>	 definitely, I'm confused on the version though as rsyslog switched to 8.YYMM scheme
[15:17:58] <jayme>	 dunno, thats what the docs say :D https://www.rsyslog.com/doc/configuration/modules/omkafka.html#resubmitonfailure
[15:18:21] <jayme>	 but I always get confused by rsyslog versions...so maybe they do as well :D
[15:18:39] <godog>	 ah got it, nevermind for some reason I thought that was added on one of the latest versions we're not running yet
[15:18:42] <godog>	 glad that's not the case
[15:51:08] <inflatador>	 Of amypme
[15:59:48] <inflatador>	 sorry! Meant to say, could anyone help me understand why this is failing CI? https://gerrit.wikimedia.org/r/c/operations/alerts/+/1005791
[16:04:55] <godog>	 inflatador: I checked the CI logs and the alert isn't firing anymore, the test value is now off by one so either that +1 or switch from > to >= in the alert
[16:08:07] <godog>	 also you should be able to switch to sth like 100 * 2^30 both in the alert and in the test, I'm not sure about the latter tho
[16:19:24] <inflatador>	 godog ACK, thanks for taking a look
[16:23:46] <godog>	 sure np inflatador