[09:44:14] <volans>	 good morning, as some SRE team is starting to use alertmanager-based alerts I will need to add support for it to spicerack at some point, at least for the basic add/remove silences functionality.
[09:45:40] <volans>	 I'd like to know which API should I look at (I see they have both a v1 and a v2, with the v2 being at version: 0.0.1), which endpoint should I hit (are silences replicated across DCs?) and if I need anything wrt authn/z
[09:53:20] <godog>	 volans: v2 api is the one to use, yes silences are replicated, you can use alertmanager-<site>.wikimedia.org:9093, though auth is host-based ATM so we'll need to add cumin hosts
[09:53:53] <godog>	 to give you an idea https://www.robustperception.io/creating-alertmanager-silences-from-python
[09:54:46] <volans>	 yes I did had already found that blog post :)
[09:54:54] <volans>	 s/did had/had/
[09:55:19] <godog>	 re: allowing new hosts, the value to change is profile::alertmanager::api::rw
[09:58:13] <godog>	 I misspoke, you can talk to apache directly on port 80 not port 9093
[09:58:23] <volans>	 ack, thanks and what about selecting those? for the current use case, reimage of a cp host I see that the only related tag is instance: cp4021:9331
[09:58:43] <volans>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=cp4021
[09:59:25] <godog>	 yeah silencing instance=~<hostname>:.* will likely do the right thing
[10:00:14] <godog>	 I've been pondering though if it makes more sense to strip the port from instance when sending out alerts, so e.g. grouping is easier
[10:02:50] <volans>	 if the port might be useful maybe we could keep it as a separate tag, but yeah seems sane to allow easily to select all alerts pertaining a certain host
[10:04:27] <godog>	 indeed, I'll file a task for the port stripping and document the api chat we just had
[10:04:45] <volans>	 thanks
[10:05:08] <volans>	 not sure if we should limit API access in RW mode to an allowlist
[10:08:13] <godog>	 I'm sure we should
[10:09:49] <volans>	 +1
[10:11:42] <volans>	 although if I curl http://alertmanager-eqiad.wikimedia.org/api/v2/ from cumin2002 I get 403
[10:11:53] <volans>	 so maybe I misread what you said above about port 80
[10:14:07] <godog>	 ah yeah the allowlist is still enforced by apache
[10:14:13] <godog>	 the ro/rw split that is
[10:14:28] <volans>	 ack good to know
[10:18:26] <godog>	 ok minimal docs at https://wikitech.wikimedia.org/wiki/Alertmanager#How_do_I_access_the_API? feel free to change/integrate at will
[10:20:11] <volans>	 ack, and I can talk to either of the 2 endpoints and just one is ok or is there a way to get a "master"?
[10:20:46] <godog>	 not really no, you can talk to either, they are clustered underneath
[10:20:52] <volans>	 ack
[10:20:56] <volans>	 I'll add to the doc
[10:21:10] <godog>	 thanks!
[10:22:40] <volans>	 {done}
[18:35:00] <cdanis>	 hi o11y friends, just thought I would leave you a nice bowl of code pasta you might enjoy.  https://phabricator.wikimedia.org/P17330
[19:10:59] <herron>	 cdanis: interesting!  what is the reasoning/benefit in using the count_over_time and average latency in this case as opposed to say bucketed latency metrics?
[19:11:30] <cdanis>	 herron: expediency :)
[19:11:34] <herron>	 something like (labels removed for readability)  100* sum(mediawiki_http_requests_duration_bucket{le="10"}) / sum(mediawiki_http_requests_duration_count) 
[19:11:40] <cdanis>	 yeah I think that would be better
[19:12:02] <cdanis>	 I mostly wanted to figure out how to compute SLOs at all, independent of the 'underlying' metric
[19:12:22] <cdanis>	 (and without requiring recording rules up-front, so you can experiment with an SLO choice before encoding it that way)
[19:13:11] <herron>	 nice, makes sense
[19:13:13] <cdanis>	 it took some trial and error but I did figure out the right combination of subqueries and count_over_time and such
[19:16:49] <herron>	 ah yeah can totally see that.  I like the in query comments too