[09:44:14] good morning, as some SRE team is starting to use alertmanager-based alerts I will need to add support for it to spicerack at some point, at least for the basic add/remove silences functionality. [09:45:40] I'd like to know which API should I look at (I see they have both a v1 and a v2, with the v2 being at version: 0.0.1), which endpoint should I hit (are silences replicated across DCs?) and if I need anything wrt authn/z [09:53:20] volans: v2 api is the one to use, yes silences are replicated, you can use alertmanager-.wikimedia.org:9093, though auth is host-based ATM so we'll need to add cumin hosts [09:53:53] to give you an idea https://www.robustperception.io/creating-alertmanager-silences-from-python [09:54:46] yes I did had already found that blog post :) [09:54:54] s/did had/had/ [09:55:19] re: allowing new hosts, the value to change is profile::alertmanager::api::rw [09:58:13] I misspoke, you can talk to apache directly on port 80 not port 9093 [09:58:23] ack, thanks and what about selecting those? for the current use case, reimage of a cp host I see that the only related tag is instance: cp4021:9331 [09:58:43] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=cp4021 [09:59:25] yeah silencing instance=~:.* will likely do the right thing [10:00:14] I've been pondering though if it makes more sense to strip the port from instance when sending out alerts, so e.g. grouping is easier [10:02:50] if the port might be useful maybe we could keep it as a separate tag, but yeah seems sane to allow easily to select all alerts pertaining a certain host [10:04:27] indeed, I'll file a task for the port stripping and document the api chat we just had [10:04:45] thanks [10:05:08] not sure if we should limit API access in RW mode to an allowlist [10:08:13] I'm sure we should [10:09:49] +1 [10:11:42] although if I curl http://alertmanager-eqiad.wikimedia.org/api/v2/ from cumin2002 I get 403 [10:11:53] so maybe I misread what you said above about port 80 [10:14:07] ah yeah the allowlist is still enforced by apache [10:14:13] the ro/rw split that is [10:14:28] ack good to know [10:18:26] ok minimal docs at https://wikitech.wikimedia.org/wiki/Alertmanager#How_do_I_access_the_API? feel free to change/integrate at will [10:20:11] ack, and I can talk to either of the 2 endpoints and just one is ok or is there a way to get a "master"? [10:20:46] not really no, you can talk to either, they are clustered underneath [10:20:52] ack [10:20:56] I'll add to the doc [10:21:10] thanks! [10:22:40] {done} [18:35:00] hi o11y friends, just thought I would leave you a nice bowl of code pasta you might enjoy. https://phabricator.wikimedia.org/P17330 [19:10:59] cdanis: interesting! what is the reasoning/benefit in using the count_over_time and average latency in this case as opposed to say bucketed latency metrics? [19:11:30] herron: expediency :) [19:11:34] something like (labels removed for readability) 100* sum(mediawiki_http_requests_duration_bucket{le="10"}) / sum(mediawiki_http_requests_duration_count) [19:11:40] yeah I think that would be better [19:12:02] I mostly wanted to figure out how to compute SLOs at all, independent of the 'underlying' metric [19:12:22] (and without requiring recording rules up-front, so you can experiment with an SLO choice before encoding it that way) [19:13:11] nice, makes sense [19:13:13] it took some trial and error but I did figure out the right combination of subqueries and count_over_time and such [19:16:49] ah yeah can totally see that. I like the in query comments too