[08:01:11] mutante: personally I think the loud/visible alerts the higher level/layer they are the better, and for deep diving we also want internal http/tcp checks where applicable, though at lower severity [13:11:30] godog: hey, I have VMs failing puppet with 'parameter 'timeout' expects a match for Pattern[/\d+\s/], got '3s'', seems to be caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/815685 [13:13:33] taavi: ugh, thank you, you are right [13:13:36] looking [13:13:52] serves me right for not testing the change from String[1] [15:18:29] hey o11y -- I didn't actually get paged through VO for any of the ProbeDown/FrontendUnavailable alerts starting around 15:03, just got the #p age hotword in IRC -- I've lost track, is that expected or no? [15:18:45] (alertmanager alerts, that is) [15:19:04] rzl: were you oncall in the business hours rotation at the time of the page? [15:19:10] yes [15:19:18] then you should have received them... [15:19:56] lmata: ^^^ I think this is for you [15:20:27] ack, in meeting atm but we'll look into it [15:20:45] thanks for raising this. [15:21:06] thanks <3 I did get the pages last night, but those were via icinga and also batphone hours -- not sure which of those was the significant difference [15:21:40] I have an interview 19-20 UTC today, but happy to field test pages any *other* time :D [15:54:37] rzl: I think I see what happened, those alertmanager alerts are routed using a VO routing key called 'sre-batphone', which was not configured to hit the business hours oncall (due to unfortunate naming). This routing key is now associated with business hours oncall, and also we'll also go back and probably rename this to something not related to the paging rotation name [15:56:19] ah, cool -- stupid question, but after the change just now, will that routing key still work on the weekend too? :) [16:04:36] rzl here's how it's configured, the change just now added 'sre-batphone' to the list of routes https://usercontent.irccloud-cdn.com/file/cHDnFawy/Screen%20Shot%202022-07-20%20at%2012.00.48%20PM.png [16:05:15] thanks! [18:30:08] just confirming this again.. if a blackbox alert is currently not alerting.. you can't see it. only alerts that are active are visible, is that right? [18:30:27] sometimes I feel the urge to link to an OK / not alerting check.. just to show it's indeed green [18:30:40] but then can never find those.. but I think that's normal [18:31:11] it's just a completely different concept in icinga vs alertmanager/prometheus [18:34:15] mutante: that's correct. the rule and its state can be viewed on the prometheus UI, but it requires port forwarding to access it at the moment [18:37:31] ACK, thanks cwhite [18:37:46] kind of makes it a bit hard to confirm a new check actually works [18:37:56] all i can do is wait for _not_ getting a message [18:38:38] but yea.. I configured a "-test" IRC channel for alerting output [18:38:44] and nothing is in there [18:38:59] mutante: what service are you wanting to check state on? [18:39:02] in Icinga though I would have resolved the ticket with a link to the green OK and be sure [18:39:14] for example vrts on otrs1001 [18:39:19] but also gitlab [18:41:30] https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1 [18:41:37] the "otrs1001:1443" [18:42:26] and then unrelatedly I am intested in phab1001:443 which is at 50% but there is the pending change to switch it to prometheus..which I am concerned about becuase that will page entire SRE [18:42:46] the 50% there kind of confirms my concerns about false positives [18:43:10] what I really wanted for now was to confirm my new otrs1001 checks are "green" [18:43:29] because I had what felt like 10 follow-ups to make it work:) [18:43:48] wrong protocol, not following redirects, my service does not listen on IPv6, wrong path .. ::) [18:48:58] mutante: this may help? [18:49:01] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22thanos%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22probe_success%7Binstance%3D~%5C%22gitlab.%2B%7Cotrs.%2B%5C%22%7D%22,%22editorMode%22:%22code%22,%22range%22:false,%22instant%22:true,%22format%22:%22table%22,%22exemplar%22:false%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22% [18:49:03] 7D%7D [18:49:29] bah, exceeded line length [18:51:12] I combined the 2 lines and got to https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22thanos%22,%22queries%22:%5B%7B%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D [18:51:46] but that is an empty screen.. so I hit the "run query" button [18:52:12] don't really see a result though [18:52:19] `probe_success{instance=~"gitlab.+|otrs.+"}` [18:53:22] ah, so I put this into the "metrics browser" field and hit "run query" [18:53:26] in the options, I set Format:"Table" and Type:"Instant" which gave me a list of the probe metrics and their latest value [18:53:29] then I get a straight blue line [18:54:11] it seems there should be 3 lines, a yellow, a green and a blue one [18:54:23] or they are just all "1" [18:54:35] checking options now [18:55:12] They're all "1". The table should highlight this [18:55:24] I am in "preferences" but can't see that [18:56:01] I think I see now what you mean.. trying [18:56:38] ok, I got a table with a bunch of "probe_succes" fields in it. thank you! [18:56:54] \o/ [18:58:00] what's the opposite of "probe_success" in the query? [18:58:04] probe_failure ? [18:59:07] probe_success being 0? [18:59:35] ok, makes sense [19:00:54] it's kind of the opposite of Linux return codes though [19:01:13] success 0 = did not succeed? [19:01:51] I would expect 0 is success and 1 is failure because of that [19:03:55] you can think of 1 = true and 0 = false, if that's helpful [19:04:47] yea, I mean.. once I know it's ok.. it's just messing with my head a bit because I am so used to 0 being success with everything [19:05:13] and Icinga is therefore the straight opposite [19:05:35] no worries though, if this 1 here means it works..then it works and I'm happy