[10:44:18] btullis: i thught it might be better to move the conversation here as its more in the o11y bag then mine so mine and want to avoid saying seomthing wrong [10:44:49] in relation to the question, yes how things are configuered post that change anything marked as page will go to the core sre oncall team [10:44:50] Yep, cool. [10:45:06] if this is desired or not is probably more for you to decided [10:45:30] however the main reaons for this is that during srint wee we started to tage alerts with teams automatically [10:45:58] e.g. https://gerrit.wikimedia.org/r/c/operations/alerts/+/902457 [10:46:00] Yep, that's absolutely fine by me. [10:46:44] so the end result is that we want everything that shuld page to always page sre core team [10:47:14] however also allow teams to also handle and route the alert there own way [10:47:56] OK, great. So does that mean that in this case we can *also* get disk space alerts to page data-engineering-page, for servers that we own? [10:48:06] i think in a perfect world each team would have there own oncall rotation and they would get the initial page then escalate to whoever needs be i.e. they wake sre core if needed [10:48:46] but as things stand we just have the sre core team so the previous data-engineering config looked like it may miss sending pages to the correct tea [10:49:12] well the disk space alert is only warning [10:49:29] but if it was paging then yes it would send to sre core oncall and data-engineering-page [10:50:04] of course all of this is a moving target and im sure godog and otheres would welcome feedback or alternate options [10:50:22] and as said feel free to revert if this is not desired [10:50:49] Ok, got it. Many thanks for the explanation. I'm all fine with it. I should get Steve signed up to victorops. [10:51:35] np [11:03:42] yes basically +1 to what jbond, all correct [11:04:30] btullis: indeed an oncall rotation for data-engineering is possible, i.e. a separate team and we can set that up [11:06:17] godog: Thanks. We still use the analytics team at the moment https://portal.victorops.com/dash/wikimedia#/team/team-4bCl5lW31cloz1WT/users but there are only two of us on it and not much wired up to it. [11:06:39] I just invited steve_munene as per: https://wikitech.wikimedia.org/wiki/Splunk_On-Call#Invite_a_new_user [11:06:58] btullis: sounds great! thank you for that [11:07:29] but yes as John mentioned we can revert the change for pages to go only to data-engineering and they can be escalated to sre as needed, that works too [11:07:43] ...but we're happy to keep improving things and I'm sure we will want to get more pages over time, if you know what I mean :-) [11:08:26] Nah, keep it as it is, thanks. Good progress. We can review as we start using it more. [11:08:54] * godog nods [11:11:00] We've still got one victorops page alert wired up from Icinga and actually that pings my phone if superset-next.wikimedia.org is down and it shouldn't. I've just lived with it for 18 months and not fixed it. https://icinga.wikimedia.org/cgi-bin/icinga/config.cgi?type=services&item_name=an-tool1005%5ECheck+that+superset+http+server+is+responding+ok [11:12:30] ack, yeah we can move that to prometheus-based checks and alerts nowadays, happy to review too, we'll eventually get to it as well [11:13:29] Q4 we'll be auditing icinga anyways to see what state of decom we can get it by the end of next FY [11:15:01] +1 thanks.