[08:49:59] jayme: objections to just restart rsyslog on kubestage ? [08:50:30] godog: jelto (ping) is looking at this currently AFAIK [08:51:20] ah ok, thanks jayme / jelto, I'll leave it alone for now and ping me / LMK how it goes [08:54:53] godog: restarting rsyslog on kubestage was also something I thought about. However if this is not time critical/blocking I would like to get some more insights why this is happening [08:58:07] jelto: agreed, I don't think it is super critical but I'm not a kubestage user either [08:58:40] at any rate let me know how it goes and I can help [08:58:52] s/and I/and if I/ [09:24:21] jelto: as it seems both nodes are equally broken, you could restart rsyslog on one (to see if it helps) and keep the other one for further debugging [09:28:01] yes thats true, kubernetes logs from kubestage1001 and kubestage1002 are both missing. I can try to restart rsyslog on kubestage1001. however there is also a systemd timer which restarts rsyslog every day and it should be restarted 21h ago [09:39:21] oh [09:39:26] godog: I restartet rsyslog on kubestage1001 and logs are appearing in logstash again. I captured the state of rsyslogd. I can do some more debugging on kubestage1002 late in the day. But I think it makes more sense to check why the restart systemd timer for rsyslogd doesn't work (as this is a known problem https://wikitech.wikimedia.org/wiki/Rsyslog) [09:40:25] jelto: might be wise to create a phab ticket when you're back to have this documented [09:45:26] jayme: I will do that when I'm back! if missing logs from kubestage1002 are urgent feel free to restart rsyslogd there as well [09:46:40] I think it's fine to leave it in that state for a bit. I'll quickly cordon 1002, so new stuff will be sheduled to 1001 and will have logs [09:49:32] ack on leaving 1002 as is [09:49:55] jelto: got it, the rsyslog restart timer is only active/needed on centrallog hosts though, not the whole fleet [09:50:34] as in, rsyslog regularly stuck has been observed only on those hosts receiving all logs via tls, normally rsyslog isn't restarted [09:51:13] +1 also on capturing more of its state, a bummer of course that a restarts "fixes" things [10:17:17] ello! there's an icinga check that's currently failing erroneously (for lvs2009 pybal backends) that I would like to ack. If I ack it and the error changes in future, will the ack expire? [10:17:30] I'm concerned that acking the check will potentially hide real errors in future [10:26:02] hnowlan: good question, IIRC acks don't expire on status message change by default but on recovery [10:27:13] I have to run to lunch, bbl