[07:29:15] I have a question about pages/victorops/alertmanager, we had acknowledged a page on victorops, and added a karma ack (!ACK message) on alertmanager, but after victorops ack expired it paged again, what should have been the process to avoid that instead? [07:29:43] Resolving on victorops side and !ACK on alertmanager? Using an icinga downtime? any other? [07:30:19] Currently I have resolved on victorops, and still have the ack on alermanager (see https://alerts.wikimedia.org/?q=instance%3Dcloudnet1004%3A9100), nothing on icinga [08:35:12] dcaro: best to ask in #w-observability [09:02:46] jynus: you happy for me to mereg https://gerrit.wikimedia.org/r/c/operations/puppet/+/802165 now? [09:03:11] yes, I am around to test it [09:03:33] cool mergeing [09:03:38] ... [09:03:47] merged [09:03:59] will run puppet and restart prometheus [09:04:06] ack thanks [09:09:23] jbond- what are the implications of sudo_user => bacula? [09:09:37] what grants does nagios recive from that, exactly? [09:10:42] I see "nagios ALL = (bacula) NOPASSWD: /usr/bin/check_bacula.py --icinga" [09:12:50] it is an exact match, right? [09:16:15] prometheus exporter looking good [09:16:47] and if I try to do "sudo -u nagios sudo -u bacula /usr/bin/check_bacula.py --icinga --list-jobs" it doesn't suceed, so all good on my side [09:25:28] jynus: great thanks [10:43:02] jbond: thanks [10:45:40] np :) [11:26:11] jbond: I'm curious though, what do you do in core sre for alerts? [11:35:41] jynus: sorry i missed your question. but yes it dose a match for what ever the command passed is [11:41:44] dcaro: curious was this definetly the same incident or did it clear and re-fire? tbh im not sure of the answer im not sure i have seen this happen before (perhaps we have been lucky or someone is adding some override) so was also curious to see what o11y says. [11:42:23] i dont see anything obvious in either the splunk oncall or alertmanager wikitech pages [11:43:49] jbond: it was the same incident, it's just that there's some default 1day expiry on victorops side for an 'ack', and when that happens it pages again for the same issue (no clearing and refiring) [11:47:17] dcaro: if the expiry is a day then i expect we either fix it in that time or depool and disable paging [11:48:17] things are not so easy sometimes xd, how do you disable paging? [11:48:32] cookbook? [11:55:56] dcaro: alerting/paging can be tweeked with `profile::monitoring::notifications_enabled, profile::monitoring::do_paging and profile::monitoring::is_critical` (no idea if theses work with wmcs) howerver i think an alertmanager silence would also help (cc cwhite herron to keep me honest) [11:57:57] that works for whole hosts, but some alert are not host-related (ex. some run on alert1001, but check other hosts), those can't be tweaked there right? [12:01:40] dcaro: correct. in which case there is silences and icinga downtime (and there is a cookbook for that) [12:04:02] yep [12:04:14] (but you can't do it from your phone xd) [12:06:23] if a prod service was in a paging state for 24 hours then someone should have probably been on there latop fixing it about 23 hours and 55 minutes ago ;) [13:47:18] [Cross-posting from Slack channel] [13:47:19] The Global Data and Insights team is getting organized to distribute the annual Community Insights survey, using the Emailuser API. We will be sending this to 25000+ users. I wanted to check if there is any suggested delay for a certain number of API calls to avoid overloading. [13:49:25] https://www.mediawiki.org/wiki/API:Etiquette [13:49:57] I dont have much experience with that but I know everytime someone is rate limited they get pointed at that page ; D [13:50:15] though that is likely different from email api dunno [13:52:42] Thank you. The page doesn't seem to mention any hard limit, so just be reasonable I guess. [13:53:22] KCVelaga: I suspect that may be perceived as spam and misuse of the peer-to-peer email user feature, partly because of how those emails will appear to users (as coming from an odd specific use acount, instead of as from the foundation). afaik we usually distribute such comunication by other means, so my suggestion would be not about how to use the API, but first consider the overall approach. [13:55:07] I suggest reaching out to AHT in Product and/or CommRel, both of which are doing these kinds of communications more regularly, e.g. around surveys and board votes. There's probably a maintenance script we can run server-side for this via the WikimediaMaintenance extension, or via the MassMessage extension. [14:02:06] Krinkle Yes, we will be using a WMF associated account, and are working with T&S for the required permissions to send this message out. In the past, these emails used to be sent through an external service (Quatrlics), and we would like to shift from that. The server side execution doesn't trigger Echo notifications, as that is crucial for response [14:02:07] rate. MassMessage option reveals the sampled users, which can compromise anonymity of the survey. [14:08:02] KCVelaga: ack, that makes sense, hence the email approach. Got it. I understand the desire for Echo, but I think that's just something we we haven't built that way and there's I suppose also a balancing argument to be had about how much signal is too much in the greater scheme of things, and potentially an argument for how professional it comes across to the community and by making use of familiar and expected paths within the culture [14:08:02] rather than something novel and not-for-purpose. My intuition would guide away from Emailuser API for this purpose. [14:08:20] Anyway, if CommRel have recommended this approach upon describing the need, then API:Etiquette indeed describes basically it, that, and the from-there [[m:User-Agent policy]] page is important as well. [14:10:28] Thanks for feedback Krinkle! [14:10:49] KCVelaga: our API is mainly focussed on currency in terms of policy, not throughput, so zero concurrency, one at a time as fast as the API responds. Don't worry about the rate per minute so long as your script wont' start the next one before the previous is finished. [14:12:25] That is helpful. Thank you. [16:43:49] https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts pointed to https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1 which doesn't exist anymore btw. I am not sure which dashboard replaced that one, but we should update the link [16:49:29] akosiaris: thanks for pointing that out. I updated that link [16:57:09] thanks! I tweaked it again [16:58:20] so, an interesting thing here: there was a [minor] traffic drop at esams at the time of the NEL event. however, the traffic drop alert did not fire for it (and generally, the traffic drop alert fires when a scraper "goes away"/finishes after sending a lot of traffic) [16:58:40] honestly I'm tempted to suggest retiring the traffic drop alert [16:59:09] cdanis: we have talked about it yep :) [17:00:02] oh cool [17:19:29] What's the etiquette for when a talk page item isn't addressed, probably because nobody's watching it? Be Bold and change it anyway? [17:25:17] brett: what do you mean by "change it anyway"? [17:25:51] cdanis: I had proposed a change but wanted some consensus before committing it [17:26:35] being bold and just doing it is a sensible default, depending on the nature of the change [17:28:10] Thanks for that. I figured that was the case but wanted to make sure. [17:30:23] For a simple case, I was hoping to retire the ancient https://wikitech.wikimedia.org/wiki/VPN - is it correct to move it to the "Obsolete" namespace? [17:30:49] +1 [17:30:59] gracias [18:40:45] https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Organization mentions variable lookup in a somewhat confusing way - The article was written in 2014 so I'm guessing it's outdated but it would be good to get clarification on how hiera is utilized accurately (see the talk page) [20:16:26] TIL: digrc! +nostats +nocomments +nocmd +noquestion [20:18:12] yes! mine is just "+noall +answer" which is probably close to that [20:26:13] oh, very nice, I stole mine from the first result on google for digrc, since the manpage doesn't give many details on usage