[08:49:59] <godog>	 jayme: objections to just restart rsyslog on kubestage ?
[08:50:30] <jayme>	 godog: jelto (ping) is looking at this currently AFAIK
[08:51:20] <godog>	 ah ok, thanks jayme / jelto, I'll leave it alone for now and ping me / LMK how it goes
[08:54:53] <jelto>	 godog: restarting rsyslog on kubestage was also something I thought about. However if this is not time critical/blocking I would like to get some more insights why this is happening
[08:58:07] <godog>	 jelto: agreed, I don't think it is super critical but I'm not a kubestage user either
[08:58:40] <godog>	 at any rate let me know how it goes and I can help
[08:58:52] <godog>	 s/and I/and if I/
[09:24:21] <jayme>	 jelto: as it seems both nodes are equally broken, you could restart rsyslog on one (to see if it helps) and keep the other one for further debugging
[09:28:01] <jelto>	 yes thats true, kubernetes logs from kubestage1001 and kubestage1002 are both missing. I can try to restart rsyslog on kubestage1001. however there is also a systemd timer which restarts rsyslog every day and it should be restarted 21h ago 
[09:39:21] <jayme>	 oh
[09:39:26] <jelto>	 godog: I restartet rsyslog on kubestage1001 and logs are appearing in logstash again. I captured the state of rsyslogd. I can do some more debugging on kubestage1002 late in the day. But I think it makes more sense to check why the restart systemd timer for rsyslogd doesn't work (as this is a known problem https://wikitech.wikimedia.org/wiki/Rsyslog)
[09:40:25] <jayme>	 jelto: might be wise to create a phab ticket when you're back to have this documented
[09:45:26] <jelto>	 jayme: I will do that when I'm back! if missing logs from kubestage1002 are urgent feel free to restart rsyslogd there as well
[09:46:40] <jayme>	 I think it's fine to leave it in that state for a bit. I'll quickly cordon 1002, so new stuff will be sheduled to 1001 and will have logs
[09:49:32] <godog>	 ack on leaving 1002 as is
[09:49:55] <godog>	 jelto: got it, the rsyslog restart timer is only active/needed on centrallog hosts though, not the whole fleet
[09:50:34] <godog>	 as in, rsyslog regularly stuck has been observed only on those hosts receiving all logs via tls, normally rsyslog isn't restarted
[09:51:13] <godog>	 +1 also on capturing more of its state, a bummer of course that a restarts "fixes" things
[10:17:17] <hnowlan>	 ello! there's an icinga check that's currently failing erroneously (for lvs2009 pybal backends) that I would like to ack. If I ack it and the error changes in future, will the ack expire? 
[10:17:30] <hnowlan>	 I'm concerned that acking the check will potentially hide real errors in future 
[10:26:02] <godog>	 hnowlan: good question, IIRC acks don't expire on status message change by default but on recovery
[10:27:13] <godog>	 I have to run to lunch, bbl