[09:10:24] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10fgiunchedi) The code for silencing itself is merged now, I'd imagine there are other followu... [11:06:50] 10Puppet, 10Infrastructure-Foundations, 10SRE: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) [11:21:24] 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Spicerack: add support for Alertmanager - https://phabricator.wikimedia.org/T293209 (10Volans) >>! In T293209#7759313, @fgiunchedi wrote: > The code for silencing itself is merged... [12:06:06] godog: I've deployed the new spicerack, testing it on cumin1001, all looks good so far [12:06:14] I noticed a weird thing in the alertmanager UI though [12:06:58] if I click on teh silence icon to create a new one, and then click browse, I can see my just created silence, but the "View in Alertmanager" link is wrong: [12:07:01] http://localhost:9093/#/silences/7304e981-d71e-49f1-b4b7-728b9ed1ac36 [12:09:47] the other small bit, is that trying to delete an expired downtime gives 500 :/ [12:11:24] it retuns a quoted string with: silence 7304e981-d71e-49f1-b4b7-728b9ed1ac36 already expired [12:11:31] and status code 500 [13:19:48] volans: thanks for the release! appreciate it [13:20:27] the non working link is expected, alertmanager in debian doesn't ship the UI by default, though we could fix that [13:21:01] a bummer though for the 500 on expired silences, I can see how that would be a problem with downtimed() for example [13:28:22] godog: yes, I was thinking that maybe we could add a check that if 500 and the returned json ends with "already exipired" we just log it and go on without raising an exception [13:28:26] thoughts? [13:28:59] I think it's unusual that downtimed() is used with a duration that is shorter than the actual run, but it might totally happen [13:29:32] volans: yeah special casing the expired downtime SGTM [13:30:13] not sure if worth a bug upstream, I don't see how 500 can be a correct http status code [13:30:43] agreed I'd say more of a 400 than 500 [13:35:00] godog: so, upon testing, the DELETE on the silence ID atually makes it expire immediately, and I guess that's ok so it stays on the history [13:36:15] makes sense yeah [13:36:25] but once deleted, it can't be re-enabled anymore. the UI has a "recreate" button that goes to the create form pre-filled but will be a new silence [13:36:41] so I would expect 410 as http code for the DELETE of an already expired silence [13:36:51] 410 Gone (that is) [13:37:53] agreed I'd expect the same [13:38:27] if I try to delete a random non existend ID, I get the same 500 with: silence not found [13:38:39] the response is again a JSON with just a string, not a JSON object [13:39:13] response.text => '"silence not found"\n' # to be clear [13:39:42] so I guess for now we could match the end of the message and not alert on already deleted and raise on all other cases [13:40:00] and ideally get upstream to improve their API responses [13:40:43] +1 yeah, can't find a specific issue already on https://github.com/prometheus/alertmanager/issues [13:40:58] * volans was looking too [13:42:28] godog: https://github.com/prometheus/alertmanager/blob/main/api/v2/openapi.yaml#L127 [13:46:26] lol, POST to /silences if it has an ID updates an existing silence, and gives back 404 if not found, so kind of them :) [13:47:14] also GET /silence/$ID gives 404 [13:49:27] heheh [13:49:58] from that description I'm guessing clients are supposed to check first [13:50:49] that's an option too, GET first, if not exists log and skip, if exists DELETE, but you still have the race condition of the silence that expires in between the GET and the DELETE :D [13:51:30] no strong preference between the 2 approaches [13:52:36] indeed, not great either way but IMHO might as well match the 500's text and that's it, it isn't supposed to happen very often anyways [13:53:03] yep [13:56:47] volans: are you looking into catching the 500 ? I can but starting mid next week after I'm back from OOO [13:57:10] ah, if you're out sure, I can take care of it [13:58:43] thanks! yeah I'll be out wed -> tue [14:00:38] ack, no prob [14:00:42] enjoy! [14:10:00] ehehh cheers [15:42:17] godog: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/769063 [15:43:51] volans: ack, taking a look [15:59:34] godog: thanks for the quick review. I've sent also https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/769067 that should cover the downtime cookbook (the one used by reimage and the primary cause of alert noise) [15:59:52] I will send other patches for all the other usages of icinga_hosts() later on [16:02:51] the sre.hosts.remove-downtime one instead will need to wait some additional feature in the module and some refactoring. TBD if it should require downtime IDs to delete or search for them based on the hostnames, basically filtering /silences [16:11:30] volans: ack (in meeting, will take a look later) [16:11:48] no prob [16:29:52] nice re: downtime cookbook [16:40:46] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Milimetric) > Perhaps a way forward would be to find a way to serve those use cases by design instead of by accident.... [16:52:42] logging off, talk to you next week [16:53:32] enjoy the time off! [20:16:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10jbond) brandon also just pointed me to `git grep netmapper` and https://gerrit.wikimedia.org/g/operations/softw...