[07:52:29] how does one escalate to the batphone? [07:53:45] victorops -> create incident -> select sre batphone from the teams/policies dropdown? [07:55:57] that's one way yeah, also klaxon comes to mind [07:56:58] also emailing the batphone VO address, which I won't repeat here but it is in icinga's config at least [07:57:12] is klaxon equivalent? it says it can "wake up an SRE", which kinda implies it's maybe not using the batphone rotation? [07:57:39] or is this just accounting for Amir TZ™ [07:59:32] haha! fair question, IIRC at the moment klaxon rings the batphone [07:59:59] true that in the future that might not be true [08:07:59] klaxon should probably go to the primary rotation (oncall person when there is, batphone otherwise) IMHO [08:09:01] agreed [08:15:18] <_joe_> so yeah we'll need to write down a way for the oncall people to escalate the issue if they can't solve it themselves [08:15:20] godog: I recall at least on VO app a chat where you could send message to a specific person, and that would trigger their notification (page/email/sms based on their settings). I'm sure I've tested this in the past, but I fail to find it now. Was it removed? [08:16:16] volans: I remember we tested that during the 2020 all hands in SF [08:16:28] yeah I reacall the same [08:16:30] But I couldn't find that a few days ago when I had to page toprank.s [08:18:18] yeah I don't seem to find it either, at least from the phone app [08:18:30] I tried on the browser too [08:19:48] <_joe_> the site and the app have different features, TIL [08:20:27] not the same thing but to engage individual people you can create a new incident and send it to one/multiple people, just tried with you volans [08:23:19] godog: I got it, but then waht would be the workflow? I need to ack it? and if it was sent to multiple people on purpose (like we need all of them ideally) that might silence additional triggers to them [08:23:59] * volans did ack it fwiw [08:24:43] volans: I don't know what would be the workflow, I'm trying to see if there's a replacement for the chat functionality you mentioned [08:24:52] thanks [08:25:18] <_joe_> I would assume that people oncall might want to send out an everyone page if they don't know what to do [08:26:10] yeah, the thing is it is better to be done via the batphone that klaxon, right? [08:26:17] I mean via VO [08:26:37] I think we need both, escalate to a specific team/people (like I need a mediawiki expert) or to the batphone [08:27:01] Yeah, but VO if sent to the batphone will only page people who are not sleeping and klaxon will wake up everyone? [08:27:08] marostegui: yes klaxon should go to the current rotation, so the oncall people only IMHO, is meant to reach the current oncall not everyone [08:27:31] ah right [08:27:48] so we need a different way to escalate [08:28:03] yeah [08:28:46] <_joe_> yep, the most natural way is to escalate to batphone, which happens automatically after 5 minutes [08:29:11] <_joe_> but I don't think 5 minutes are always enough to know if you need to [08:30:06] and I expect people will ack the alert if they get it, because they are looking at it [08:30:21] not sure if there is a way to "unack" it and let it escalate [08:32:47] IIRC you can unack, not sure about things like immediate escalation though just not to wait the timeout [08:34:59] apparently you can re-route things https://help.victorops.com/knowledge-base/reroute-an-incident/ [08:40:04] ah yeah that ought to do it! [08:41:32] <_joe_> ok, do we have a wikitech page for oncall where we can start adding this stuff? [08:41:51] <_joe_> maybe the ONFIRE groups should create it? [08:41:57] <_joe_> lmata: ^^ :) [10:56:45] <_joe_> is gerrit slow for everyone? [10:57:21] _joe_: it certainly has a steep learning curve [10:57:39] _joe_: oh, wasn't there an update being deployed? [10:58:02] seems to be ok for me [10:59:13] the new version is up, the UI too has changed a bit [10:59:20] not sure if might be related [11:04:45] _joe_:ack, I will add some links to VO user docs to an on call FAQ on wikitech, and include some basic cases like overrides etc. I’ll have things in wiki tech page up this weekend, [11:14:43] <_joe_> volans: I guess so, every operation is quite slow atm [11:41:39] godog: did you have a question for me at https://gerrit.wikimedia.org/r/c/operations/software/+/786941 ? [11:44:33] Krinkle: yes, basically a "does it look sane/ok to you?" type of question [11:45:06] godog: ok, but I have never used the script, nor have access to where it would be applied to [11:45:34] Tests look good though :) [11:46:07] Krinkle: hehe thanks! yeah that's good enough for me [13:03:37] hashar: FYI sometimes gerrit UI seems super slow to load file diffs since the latest update earlier today. Known issue? [13:04:28] <_joe_> volans: same problem I have, but I suspect it's network related though [13:04:49] client side or infra side? [13:05:52] <_joe_> client side [13:06:06] <_joe_> because when I went via a vpn it all seemed better [13:07:32] ack, I'll check that angle too [13:37:20] volans: I am guessing the diff cache got entirely invalidated as part of the upgrade this morning ;) [14:25:02] mmph. trying to reimage db2071. last output on the console is `Loading debian-installer/amd64/initrd.gz...ok` [14:25:36] I remember some of those happened for me for those range of servers [14:25:45] I think there is a task about it [14:26:12] oh, it finally booted, ~5 mins later [14:26:15] that's wild [14:26:29] task is: https://phabricator.wikimedia.org/T216240 [14:26:45] it was in the range of "misbehaving servers"- in theory fixed after a bios update [14:27:09] jynus: heh. db2071 is marked as 'fixed' in that one [14:27:16] yeah :-/ [14:28:06] I remember gathering some statistics and have like a 33% more chance of a freeze on boot *after* the bios update [14:28:36] https://phabricator.wikimedia.org/T216240#4957077 [14:29:43] nah, it was less- but definitely not 100% fixed: https://phabricator.wikimedia.org/T216240#4965474 [14:50:47] kormat: sounds like maybe the serial redirection from d-i didn;t work [14:53:23] ah, betrayal [15:40:52] volans: how hard would it be to convince sre.hosts.reimage to _not_ remove the downtime post-reimage? [15:45:09] kormat: based on a flag or what? [15:45:15] yeah [15:45:44] something like --accept-broken-system-after-install ? :-P [15:46:06] --accept-your-limitations,-program [15:48:13] not waiting for icinga to be green would also be a plus [15:49:19] those two things are close together, so it's easy to bundle them [15:49:55] the only concern I have is that you expect them to not be green for X, but maybe you get a host that is failing X, Y and Z and that gets unnoticed [15:51:26] i have an icinga page that i monitor: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=^\(db\|dbproxy\|es\|pc\)[12]&style=detail&servicestatustypes=29 [15:51:49] it shows all failing checks, even if the host is downtimed, or has notifications disabled [15:51:53] it's the only way to know.. anything. [15:55:17] kormat: the other concern is that the downtime will expire anyway a couple of hours later, when the operator running the reimage might not be around anymore, causing IRC spam [15:55:30] volans: this is causing irc spam _right now_ :P [15:55:48] the workaround, which i forgot, is making a puppet change to disable notifications for every host you're reimaging :( [15:56:13] which won't expire. so you have to look at icinga anyway. [16:05:23] sure, some other bits to think about [16:06:26] right after the check_icinga call the reimage calls repool, that so far has always been just telling the user to repool and how to prevent pooling a host in a bad state [16:06:34] but might be converted to real repool at some point [16:18:57] kormat: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/787528 [16:26:12] amazing :D [16:38:19] kormat: I'm merging it, lmk if it works as expected or you encounter any issue [16:41:40] volans: i'll be sure to give it a loving try tomorrow 💜 [16:44:07] thanks! [18:06:44] You can now use new cumin 'owner' aliases to select all hosts owned by a specific team: examples and list at https://phabricator.wikimedia.org/T306830#7888818 (sudo cumin 'A:owner-serviceops' 'uname -r') for example [21:13:13] Are there any tools available to search wiki text, e.g. I would like to know how many pages in en.wikipedia.org reference info@wikimedia.org? [21:15:37] jhathaway: https://global-search.toolforge.org/ [21:16:09] RhinosF1: oooh, thanks [21:17:58] jhathaway: no problem [21:18:04] https://en.wikipedia.org/wiki/HCard#Live_example references it [21:19:56] I find it with the regular search box when using "info@wikimedia.org". global-search is cool but still waiting for a result when doing the same over there [21:21:04] mutante: true, regular search seemed to be tokenizing it into info & wikimedia.org [21:24:04] Normal search will work if you put " " for single wiki [21:24:18] quotes seem to help 🤦 [21:24:19] Some reason sleepy brain thought you wanted to search multiple [21:24:38] RhinosF1: thanks just discovered that as well [21:25:02] :) [21:25:31] I got a timeout on global search. reguler en.wiki search or global search both need " " , ack [21:26:00] are you trying to delete that email alias or something? [21:26:16] I like everything that moves stuff out of "privateexim" module btw [21:26:53] no, just trying to understand how and where it is used, so we can have discussions with the VRT community about different options [21:27:09] aha! interesting, good [22:01:17] bd808 or Reedy : If you have some spare time, could you please take a look at gerrit change 780636 and +1 if you think it's okay? Thanks [22:04:24] hauskatze: {{done}} [22:05:29] thanks bd808 :) [22:05:39] I'll rebase it later, these merge conflicts...