[11:04:08] godog: quick one for you, how can I see all the labels/properties of the alertmanager version of the checks for the services in service::catalog? [11:04:32] I checkes the alerts repository and I guess they will have an instance label that is the service FQDN, but I'm not sure from there [11:04:40] (e.g. apertium.svc.codfw.wmnet ) [11:05:57] volans: the easiest is probably to check one of the blackbox metrics such as probes_success for job probes/service i.e. https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Bjob%3D%22probes%2Fservice%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [11:06:35] not the fqdn in instance but the service's key in service::catalog (and port) in this case [11:06:59] ok, so service-name:port [11:07:16] exactly [11:07:44] and if I'd want to downtime all things related to one of the service, will filtering by instance be good enough? [11:07:50] or will downtime also unrelated stuff potentially [11:08:19] by instance and job=probes/service will be more accurate for sure [11:08:41] though by instance too I don't think it'll be a problem [11:08:50] or said otherwise, if it is then we're doing something wrong [11:09:10] I was wondering if we might have name conflicts, hope not, but can't be sure 100% :D [11:09:24] yeah, like if we're naming an host after a service [11:09:43] but then one could argue that the downtime is still semantically correct [11:09:44] but the port should be different [11:10:08] yeah it is hard for me atm to imagine a conflict [11:10:20] ack, thanks [11:10:27] if you want to be 100% sure though by job and instance is accturate [11:10:30] sure np [11:11:03] something else to note is that there are the usual idiosyncrasies of service names and service::catalog [11:11:14] e.g. text and text-https, since they are two different catalog entries [11:11:43] yes I know, but because this is for https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/775904 I have the catalog already and hence the actual names :) [11:11:55] basically I'd like to add downtime capabilities to that module [11:12:13] ah got it! nice [11:13:14] let's see how long it will take to change the schema in puppet breaking the spicerack side of it :D [11:13:32] I have a patch to add comments in the puppet types and templates that say to change spicerack too... [11:13:43] hahah! I was thinking about that too [11:13:44] but you how much we read old comments when patching stuff ;) [11:14:02] yeah rarely, guilty as charged too [11:14:04] the good part is that now the puppet side has all custom types [11:14:16] so at least that part should be easier to spot [11:14:31] some default values are in the templates though [11:14:49] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/778332 if you're curious [11:15:37] thanks yeah that should help a little [11:15:47] godog: last one, in the alerts repo the filter is on job=~"^probes/.*" [11:16:00] should I use that or probes/service specifically? [11:16:43] mmhh yeah I don't know at this stage tbh, the probes/ namespace was my attempt to future proof the alerts a bit, but realistically I don't know if/how much we'll be using that [11:16:56] so job=probes/service I think will work for now [11:17:09] ok, the results seems the same on thanos [11:17:27] indeed they are now [11:17:31] gotta go to lunch [11:17:57] thanks, ttyl