[09:11:00] I found an interesting flaw in our backup metrics monitoring- stats for prometheus: https://grafana.wikimedia.org/d/413r2vbWk/bacula [09:11:57] I sent metrics there for the latest backup, as it is time-based metrics, so we can get a history of all backups produced [09:12:48] the problem: full backups sometimes take so much time, than by the time they finish, an incremental backup for it is schedule after that, shadowing the "last backup metrics" [09:13:19] and because the incremental runs in less than 1 minute (scraping time), it hides all metrics for the full backup [09:13:40] while I love prometheus, it is not the right model for non-time-based metrics [09:13:58] so I will have to workaround by sending the latest metrics for each level of backups [09:14:18] incrementals and full separatelly [09:15:23] I'm not sure I see the issue - you're sending metrics timestamped with the start time of the backup, or the end time? [09:16:03] scrapping happens every minute [09:16:22] so I send the state and data I have for the latest backup, finished or not [09:17:20] Ah [09:18:09] could you arrange to always send a metric when a backup finishes? [09:18:10] so what happens is: scrape (nothing new), full finishes (with the final size, files backed up, errors if any, etc), incremental runs in 0 seconds, scrape(sending data about the incremental only) [09:18:51] because of the 1 minute granularity, what I sent was not enough [09:19:10] and the start for the full were more interesting than the incrementals [09:19:13] *stats [09:19:47] Sure; I think arranging to send a metric when a backup finishes might be the answer? [I'm assuming this can be done with prometheus...] [09:19:59] Emperor: that's not possible [09:20:34] I mean, I have those metrics through other means (bacula client, logs, etc.) [09:20:52] but our prometheus setup only allows for 1 minute metrics [09:21:12] so I have to bend to that model [09:21:50] in a push model I could send metrics only at the end, but prometheus is a pull only model [09:22:51] alternatively, I can send logs to opensearch and let grafana use the logs only, but not prometheus [09:23:11] logs are more flexible, and can do what you suggest [09:24:37] not too worried because in general monitoring options are plentiful, just nice graphs are not currently reliable until I fix how I sent metrics [09:32:24] there is a pushgateway - https://prometheus.io/docs/instrumenting/pushing/ [09:32:43] [which I guess would give you a separate set of metrics about completed jobs only, rather than about the running ones] [09:34:56] that's setting up infra I don't want to maintain, it is just easier to split the metrics [09:39:11] in any case, not a big deal, my point was more like prometheus is great, but doesn't fit well for all monitoring needs [09:48:29] fair enough [19:47:47] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [23:47:47] (SessionStoreOnNonDedicatedHost) firing: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost