[01:15:39] * bd808 off [02:17:46] tools-db alerted but seems fine, I'm not sure what's up [02:23:58] Well folks that's three outages within 8 hours, only two more and we win a stuffed sonic plushie [03:28:26] New version of the grid-killing scripts at https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/2 <- new pr because I can't get used to the gitlab patch model [04:30:38] * dhinus paged: the cloudvirt node cloudvirt1063 is unreachable [04:31:07] dhinus: I took it offline and tried to silence it but apparently failed. [04:31:56] no prob [04:33:04] I'll go back to sleep :) [04:33:15] sorry for the rude awakening [09:42:01] * dhinus paged: ToolsToolsDBWritableState [09:43:49] restarted toolsdb and set to read-write [09:45:25] did we get any logsV [09:45:26] ? [09:45:59] nope, because the tools-db-1 VM was restarted 7 hours ago :( and I didn't get to add the "tee" yesterday [09:46:07] I will do it today [09:47:34] the cloudvirt1063 alert is still firing [09:49:57] and phaultfinder created not just one but two phab tasks about it :) [09:51:43] can I set it to "failed" in netbox? I don't remember if that's the correct procedure [09:51:48] a.ndrew has already created a task for DCops [09:52:11] any difference between them? (last time the issue was that the project was wrong, so it would get automatically changed, and then when trying to check if it already existed it did not find it again) [09:53:20] re setting as failed, I think so yes https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed [09:54:30] set to failed [09:54:49] tasks are T353406 and T353409 [09:54:49] T353406: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T353406 [09:54:50] T353409: NodeDown Node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T353409 [09:55:01] the second one is "unreachable for more than two hours." and is critical instead of page [09:58:36] we have a special alert it seems for NodeDown for cloudvirts only, but we are not filtering them out from the `down for too long` one yep [10:18:33] FYI, I'm running the sre.puppet.sync-netbox-hiera cookbook and it's showing me that cloudvirt1063 is faile in the diff, given backscroll I'll merge that along [10:24:08] moritzm: thanks, should I run that cookbook when I set a node to "failed"? [10:32:33] fixed all the puppet-alerts stuff... going to the library, be back in a bit [11:46:45] dhinus: I think so, yes. I'm not 100% sure myself what it's used for [12:45:07] yeah it's best to run it, we sync the netbox data for hosts that are status "active", if they change to or from that the sync will remove the associated data from hiera (host name, rack location etc.) [12:45:36] I'm not sure how widely that synced-data is used by puppet roles but best to not leave an outstanding diff that might confuse the next person [13:33:51] If a tool is stopped and there are no changes to the code or config files, are there any other files in the tool directory (ie /data/project/*) that could have its modification time changed/updated? [13:34:12] I'm trying to sort the tools according to their most recently updated timestamp. [13:34:12] 1:21 PM [13:34:12] When the time time comes to turn them off, I will then start from the 'oldest' [13:41:35] we manually fixed also a bunch of kubernetes configurations at some point, and similar stuff [13:41:43] that might changes some timestamps around [13:42:13] I'd say it's "approximately correct", that might be good enough for what you want to do :) [13:43:01] though make sure you don't get weird values (like all the tools were updated today, or there's some tools that were not updated since 1970) [13:50:22] is there a way to set cloudvirt1063 to "maintenance" to disable all the alerts? or should I just silencce all the alerts in alertmanager+icinga? [13:51:07] I guess the canary alert can be fixed by deleting the canary in horizon [13:52:38] I think that you can move the host out of the ceph aggregate or similar, that might work for a few of the alerts, but I think not all [13:53:51] I'll try that [13:54:21] I'm also seeing 4 separate "NodeDown" alerts in alertmanager [14:21:43] Was my main flaw setting the downtime on rather than *? If did a nice auto-complete for the former so I foolishly thought that something had changed and that that would work. (And I also didn't anticipate the sneaky 2-hour delay page) [14:23:12] dcaro: thanks! [14:24:00] Also dhinus I did move it out of the ceph aggregate didn't I? [14:24:57] andrewbogott: haven't checked yet, I'm debugging some alerting stuff with dcaro [14:25:30] ok [14:26:04] did you downtime on alertmanager or icinga or both? [14:26:52] I think the page came from alertmanager this time, and I can see the "page" alert is acked by andrewbogott [14:26:56] alertmanager [14:27:20] maybe you acked it after the alert triggered and sent the page? [14:27:38] the ACK! is at 4:31 GMT [14:28:31] the silence is on cloudvirt1063 so I suspect that like you said it was missing the * at the end [14:29:02] that sounds right [16:14:27] andrewbogott: "new pr because I can't get used to the gitlab patch model" -- I tend to force push to the dev branch. That makes it mostly like the `git review --no-rebase` command I mostly use with Gerrit. [16:15:26] GitLab can also show you diffs between force pushes done like that which makes the review experience a bit more like my typical Gerrit review workflow too. [16:15:41] Oh yeah, I actually tried to force push but it wouldn't let me [16:15:49] maybe because I was on a main branch rather than a topic branch? [16:16:04] But anyway, it's good to know that it's technically possible to do things the way I like to do them! [16:16:15] komla: [16:16:16] ah, could be yeah. the primary repo's branch protection might copy to your clone [16:16:24] https://www.irccloud.com/pastebin/ZkyhFso6/ [16:16:40] ^ poorly-formatted instructions for de- and re-gridding a tool [16:16:58] I will also send an email to the list with all that once you're convinced it works [16:17:46] I have been trying to stick with starting a `work/bd808/` branch for each set of changes whether I am working in the primary repo or a fork. It's just easier for my brain to stick with a pattern. [16:18:44] Yep, that's better than just immediately mangling the main branch which is what I did. I guess that only really makes sense if I were going to make a new fork for every patchset. [16:23:41] dcaro: quick review? https://gerrit.wikimedia.org/r/c/operations/alerts/+/983156 [16:23:43] I usually use a branch, no matter if I'm on my clone or the original repo. And sometimes add more than one commit if it makes sense (ex. small refactor + feature that uses it or similar) [16:30:01] ToolsDB crashed again. I'm restarting it. [16:30:05] this time we should have logs [16:32:47] \o/ [16:36:10] OIT says my new laptop should arrive just in time for me to take the end of year break. Looks like I might spend a good chunk of quiet week setting up a new work laptop for the first time in ~5 years. Moving to the M* series macbooks feels like a good time to actually setup deliberately rather than just restoring my prior backup on new hardware. [16:44:30] is anyone looking at the open toolforge quota request? [16:47:47] I don't think so [16:48:10] I'm not (and the weekly etherpad is empty?) [16:51:01] * andrewbogott makes the etherpad [17:46:37] andrewbogott: the disable-tool cron has started failing on tools-sgecron-2 at least https://phabricator.wikimedia.org/P54454 [17:53:01] argh [17:53:05] thanks taavi, I'll look [17:53:13] tox [17:53:17] oops wrong window [18:30:28] https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/4 [23:00:00] toolsdb will never let us rest [23:00:57] how is it that I don't get paged until 20 minutes after it goes down?