[07:29:12] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:16] Hi folks, could I kindly request two quick reviews? [10:16:18] - https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196939/comments/fc8060b2_48004837 [10:16:20] - https://gerrit.wikimedia.org/r/c/operations/alerts/+/1184039 [10:16:22] Thank you! [10:23:12] where can we see the checks for the haproxy ? [10:24:42] and slave_status_seconds_behind_master should never be used as a check for replication, ever [10:27:32] maybe it is time to remove that check [10:27:37] > where can we see the checks for the haproxy ? [10:27:39] https://w.wiki/FkyT [10:27:50] thanks [10:28:44] > and slave_status_seconds_behind_master should never be used as a check for replication, ever [10:28:50] This one is a one-to-one port of the Icinga check. [10:29:07] but the icinga check uses prometheus [10:29:23] is this just moving it from nagios to alertmanager? [10:29:26] yes [10:29:46] that's the plan [10:30:14] ok then with the technical move (the part I can review), but suggesting the dbas to drop it rather than port it [10:31:40] the other can go now [10:32:12] We should not use slave_status_seconds_behind_master [10:32:31] it is the sustained alert, which is why I am not sure it is useful [10:32:55] at least not with the current setup, maybe it can be made better in the long term [10:33:09] but leaving that decision to you [10:33:24] Maybe I can move the check to Prometheus/Alertmanager, and then once you’ve discussed it, you can remove (or reshape) the checks. Does that work for you? [10:37:53] jynus: Quick question regarding https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196939: should the alert notifications still be routed to SRE, or would data-persistence be a better option? [10:38:35] good question, which I cannot answer [10:38:51] because mostly that alert is for users in general [10:39:10] but I guess that can go to d-p [10:39:26] thx jynus [11:41:17] I am in the middle of the transfer.py refactor and I may not look at IRC often, feel free to ping me if you need my attention [12:54:12] FIRING: SystemdUnitFailed: puppet-agent-timer.service on backupmon1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:56] jynus: ^-- expected? [12:56:12] not at all [12:58:23] the only thing related to puppet I can think of is tappof's patch [12:58:54] yeah, I'm going to check [13:00:05] but puppet runs ok there [13:00:35] puppet-agent-timer.service: A process of this unit has been killed by the OOM killer. [13:00:46] puppet now requiriing a lot of memory? [13:01:18] but it is not particularly loaded [13:03:33] how does the exporter work, will it create 1 job per check? [13:03:42] yes [13:03:48] that won't scale [13:03:58] there are like 30 checks on that small vm [13:04:12] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on backupmon1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:03] that's not how npre worked- jobs were collected and aggregated [13:05:59] it is ok for now, but that needs rethinking: https://phabricator.wikimedia.org/P84201 [13:06:40] yeah, I've seen, in this case there are a lot of timers configured ... [13:07:46] let me check the load [13:09:22] yeah, it is also a bit too much, all timers run at the same time [13:10:05] although I think it was just the initial setup [13:10:09] we can survive [13:10:34] but preciselly those checks we don't need them every minute [13:10:44] I belive they used to be every 1 hour or so [13:11:04] yeah, this time we ran into trouble with the puppet-agent-timer managing 60 new resources... but as you said, we’ll definitely keep an eye on the load.. [13:11:12] maybe we can run them every hour and cache them? [13:11:20] let me check [13:11:44] backup alerts don't require fast actionables [13:12:43] yeah, they were originally set every 30 minutes [13:12:47] with 3 retries [13:14:05] so it took 32 minutes on Icinga to trigger an alert.. [13:15:10] yeah, we don't care about that, it is more of "to be looked the following day" [13:16:08] but if it has been red for a week, it is really bad, because we lost all backups since it triggered [13:19:37] Ok, I think in this case I can tune the timers to run once an hour.. I should also look for a way to "serialize" them but I'm not sure if that's possible.. [13:19:51] it's ok [13:20:05] as long as it works and doesn't mess with puppet runs [13:20:22] we should work on making it more native in the future [13:20:43] better than trying to make it better [13:21:15] I worry that it could overload if 60 connections run at the same time [13:21:32] can we add some randomization to the timing? [13:21:53] overload the db [13:21:53] RandomizedDelaySec= [13:21:58] Yes [13:22:05] I think that's the best option [13:22:13] to avoid overloads [13:22:16] and the easiest [13:22:24] yes ... I'll send you the patch ... thanks [13:22:38] let me check m1 [13:22:45] we will be able to see the overload there [13:24:53] systemd::timer::job supports fixed_random_delay and splay as a parameters, so it's possible quickly.. [13:32:33] tappof: I need to go back to python, but feel free to keep me updated/ask for reviews asyncronously here or on tickets/gerrit, etc. [13:32:52] thanks jynus [13:33:20] m1 seems fine, but prometheus may not have the granularity to see 30 simultaneous connections to the db [13:33:36] so it would be nice to have some randomness to that [13:44:16] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197655 [13:48:09] looks fine to me, but I don't know enough about timers puppetization or systemd to approve, or if it could affect negativelly other checks [13:52:29] yeah jynus, I've just got a +1 from Keith. Anyway, those checks are not really in production yet.. The rules exist both in Icinga and Prometheus, and the latter sends alerts only to Observability with info priority. It’s kind of a WIP. [13:57:35] merging... a new burst of resource changes will happen on backupmon soon... [14:00:06] ack