[13:20:34] dhinus: I have a (daily) question about kernel error alerts :) The alerting rule is [13:21:07] oh wait, I misread the regex, I just answered my own question [13:21:16] but I still don't quite understand why 1041 is alerting [13:22:01] where does the 'category' in expr: max_over_time(kernel_messages{category=~"keyword_panic|priority_(emerg|alert|crit|err)"}[24h]) > 0 come from? [13:26:21] seems like it's alerting because the log rotated, that doesn't seem right [13:29:14] andrewbogott: that comes from the shell script that writes the .prom file [13:29:33] it filters messages checking some keywords and the message priority [13:29:59] oooh ok so it's not really intrinsic to the log message, a judgement call made by the collector [13:30:25] kindof, so in alertmanager you can see the alert for 1041 has category "priority_warning" [13:30:46] that means the collector found at least 1 message with priority=warning [13:31:00] you can see those with "journalctl -k -pwarning..warning" [13:31:14] you can also check the script itself to see how all categories are assigned [13:31:30] but yes, the "category" is a concept introduced by the collector [13:32:25] I named the categories with prefixes priority_ and keyword_ to make it easier to understand where the categoriztion comes from [13:33:09] ok, I'm following now [13:33:14] so the only message it really cares about is [13:33:15] hrtimer: interrupt took 14641 ns [13:33:25] also, arturo was suggesting we should just ignore "priority_warning", although then I think that category did find something useful yesterday :) [13:33:41] maybe we could introduce a new "keyword_error" category? [13:33:58] [447573.805206] systemd-journald[1001]: /var/log/journal/e683928cdf724c658192ea076ca65dff/system.journal: Journal header limits reached or header out-of-date, rotating. [13:33:58] [448291.901776] Process accounting resumed [13:33:58] [472264.187358] hrtimer: interrupt took 14641 ns [13:34:24] Seems worth ignoring [13:34:53] was the alert on 1047 priority_warning? [13:35:13] I think so, but it contained the string "Error", unfortunately the priorities are not always assigned in a good way :/ [13:35:37] yeah, seems like if the log is actually saying 'Error' we want to know about it [13:36:09] of course that could trigger all sorts of false positives too :D [13:36:18] but I think it's worth a try [13:36:32] yeah. and we can't scan for 'err' because the log says lots of things like 'preferred' [13:39:10] would ignoring warnings solve the alert-after-reboot thing? [13:40:43] after reboot we have both priority_warning and priority_error, but the priority_error ones are fewer and are now filtered out [13:41:10] so on the latest reboot only the "priority_warning" alerts fired, and if we remove those, no alerts should fire [13:41:39] the filtering is in /etc/prometheus/kernel-messages-ignore-regex.txt [13:41:59] but the warning messages are so varied it's hard to filter them in the same way [13:43:40] dang [13:44:11] The alerts on reboot seem so bad, they train us to ignore those alerts. So I maybe think that ignoring all warnings is worth it if it solves that. [13:44:26] but as you said, some warnings might be of interest [13:44:47] I assume that if 1047 hadn't been able to correct the issue it would've errored instead of warning, but by that time we'd probably be too late [13:53:05] I think it's worth a shot adding keyword_error, and removing the alert on priority_warning [13:53:09] I'm sending a patch [13:53:23] hopefully that will get us to zero alert on reboots [13:54:25] sounds good! [14:06:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1119123 [14:42:00] follow-up: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1118547 [14:42:32] arturo: that's your patch from yesterday, I fixed the tests [16:09:22] thanks [19:35:25] On lima-kilo localhost:30003 is accessible from the host system and gives a little message. making a similar nodeport service doesn't seem to immediately act the same way. Is there something that can be run or modified to add another service that is accessible on localhost from the host system? [20:04:43] Rook: that's defined in the kind config file iirc, so it's not dynamic no, you might be able to add things there if you need [20:05:04] configured by ansible: playbooks/vars/kind.yaml [20:05:23] but you can manually tweak it for testing from within the lima-kilo vm [20:06:01] I'll look at the playbook. How does one manually tweak it while running? [20:07:40] I think it lives in the home dir, you might need to recreate the kind container :/ [20:08:27] https://stackoverflow.com/questions/68257934/edit-extraportmappings-in-kind-cluster [20:08:56] so maybe the "easiest" (but slow) might be to change your ansible values locally, and recreate lima-kilo :/, not great [20:10:53] hmm... might be posible to tweak the container directly somehow and just add the port there [20:11:12] https://www.irccloud.com/pastebin/iUbklUM9/ [20:12:30] Sounds good. I'll see what I can do. Thanks! [20:14:10] if you find a nice way might be good to add it to the readme :)