[12:03:04] hnowlan, hey, you merged https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/815311 some time ago. Could you get that change deployed? [12:08:05] zabe: for sure, will do that later today [12:12:26] thanks :) [14:31:10] zabe: done [16:10:54] if we have 3 people in vacation and they all enable their out-of-office agent and stay on root mails.. it quadruples the number of root mails for everyone, basically [16:11:40] The joys of email >.> [16:13:01] it's best when ticket systems keep responding to each other [16:15:30] mutante: if we switched root to a google group, or a mailing list, then I don't think gmail would send out of office replies [16:17:17] jhathaway: I think root mail should be treated like maint-announce@ mail. a real shared inbox, where you know if somebody else replied or not. let's discuss at SRE summit. it's already a suggested topic there [16:17:52] mutante: sounds good! [16:17:57] imho there is little point in subscribing us all to root@ only to then have everybody setup a filter rule to _not_ read it. and if nobody reads it..that is also bad [16:18:10] mutante: for sure [16:23:25] Could it be part of clinic duty? [16:24:02] That way we *can* ignore it until it's our turn [16:25:37] I think that's definitely one option, yea [16:25:46] it actually is part of clinic duty today [16:26:17] but I don't think folks always have the time to tackle all the email, and some the issues are hard to solve in only a week [16:28:10] maybe might as well just forward root mail to maint-announce, the same inbox [16:28:27] or a second shared one [16:45:32] how do people know whether/when they are on oncall duty? The roster on office wiki has been deleted and the replacement link does not work for me. [16:47:28] mutante: klaxon or calendar might be best [16:48:31] RhinosF1: Klaxon works but only if you are already on it [16:48:46] mutante: there's a calendar that I can't find a link to [16:49:43] mutante: here are the current methods https://wikitech.wikimedia.org/wiki/Splunk_On-Call#Viewing_the_business_hours_pager_schedule [16:50:16] herron: isn't there a google calendar? [16:50:26] Yes! [16:50:28] https://calendar.google.com/calendar/u/0/embed?src=kpcomsk13n79pcni0bnibndold4n15dj@import.calendar.google.com [16:50:31] RhinosF1: yes it's the first link in that doc [16:50:31] It is on there [16:50:36] herron: I'm blind [16:54:25] herron: thank you, it works. just still have to manually transfer the weeks that apply [17:01:22] mutante: sure np. fwiw I just adjusted the article hopefully it's clearer now [17:01:48] we could probably use some links to it as well, maybe that empty roster should point there? [17:02:20] herron: that's easier for my tired brain :) [18:10:56] jbond: Hello John, is it okay to merge this changes? O:redis::misc::slave: add password [18:11:45] denisse|m: yes please do [18:12:31] jbond: Merged, thank you. :) [18:12:35] thanks [22:37:36] odd thing, on search-loader1001.eqiad.wmnet there is a `puppet agent --onetime --no-daemonize --verbose --shot_diff --no-splay` running at 100% cpu since may 22, having used 122,898 hours of cpu time. What could it be doing, and should it be killed? [22:40:24] i guess it doesn't seem to have hurt anything, puppet runs are still proceeding as normal (last run a few minutes ago), but seems odd [22:51:17] (probably minutes, not hours) [22:56:42] ebernhardson: want me to kill that? [22:56:51] mutante: sure, thanks [22:59:09] done. Killed the process, it was all gone. then ran it as normal [23:01:14] ebernhardson: do you also know about this? 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters [23:02:33] mutante: it should be due to the bullseye upgrade when one of the master capable instances gets upgraded, but it should resolve itself after 30 minutes or so (however long the reinstall takes) [23:03:09] can check which hosts are capable for that cluster and verify, sec [23:04:11] as long as it's not super bad that there are 2 competing masters and you know there is an upgrade going on.. I am not concerned [23:04:59] it's fine, it's not that there are 2 competing masters, but rather than it requires 2 of the 3 master capable instances to agree on forming a cluster. If one of the two remaining ones disappear the cluster will refuse to do much of anything [23:05:00] maybe it can be downtimed for the next upgrade [23:05:11] alright [23:11:20] i think our longer plan is to increase to 5 master capable and alert when only 3 are available to reduce spam, mostly that alert is supposed to makes us aware that if any more instances go down the cluster will stop working. Probably the alert should only go to us and not everyone though, it's not generally actionable [23:44:51] ebernhardson: in another matter. I see on wdqs* hosts puppet has been disabled for 10 days. when those hosts drop out of puppet DB that will cause more issues [23:53:14] wcqs that is. it's going to be messy if they become unknown to the DB, so if there is a way to enable them let's do that. we have the "insetup" role for this if they can't yet [23:55:23] gotta go afk, laters [23:55:59] doh i probably forgot to re-enable them, checking [23:57:34] fixed