[09:29:29] dhinus: are you doing anything on toolsdb? the primary instance just went down [09:29:35] nope [09:29:42] let me look [09:30:18] I can ssh but mariadb is down [09:30:33] oom-kill [09:30:55] trying systemctl start mariadb [09:31:07] seems to have worked, at least for now [09:31:16] I can connect [09:31:54] yeah, that's odd [09:32:42] I think that it failed a few months ago with a similar issue, but maybe not exactly the same? [09:32:55] I was on holiday that time, let me find the task [09:33:14] that alert should probably be paging? the icinga check was before it was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/956071 [09:33:41] yeah I think it should [09:34:07] * taavi makes it [09:35:37] I'm always confused by the multiple levels we have (crit, critical, page) [09:35:57] crit/critical and warn/warning is confusion I've been meaning to fix [09:36:22] but is "page" higher" than "critical"? I would expect "critical" to page, but then I remember it doesn't :) [09:36:29] or maybe it does? :D [09:38:09] ah I need to re-enable write in mariadb, I always forget [09:39:31] done [09:39:41] and the alert is gone [09:54:21] the previous occurrence was T344298 [09:54:22] T344298: mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T344298 [09:54:42] I will create a new phab to track the new crash today [09:59:19] Quarry is having some issues, I intermittently get a 500 [09:59:30] (reported in the wikimedia hackathon Telegram channel) [10:01:06] openstack-browser is showing an error next to "quarry-db-02" https://openstack-browser.toolforge.org/project/quarry [10:03:35] ok it's a trove db and I see an error in Horizon/database [10:05:04] neat to see my code to highlight non-normal states in openstack-browser to be useful :-) [10:05:04] trying "restart instance" from Horizon/database [10:05:24] taavi: I was wondering how that error message worked :P [10:07:53] the link is broken though, because it assumes it's a Nova instance but it's a Trove instance [10:08:06] (the link from "quarry-db-02" in openstack-browser) [10:10:29] hm where do you see that? it correctly links to /project/quarry/database/quarry-db-02 for me [10:11:40] oh, now it's working, I clicked on it before and it led me to a "not found" page [10:12:27] that's the horrible openstack api error handler that I've been meaning to fix for ages :P [10:12:56] hahaha [10:13:20] in the meantime, I sshed to the Trove instance because "restart" did not work [10:14:09] docker logs shows some warning, but nothing critical [10:15:16] the container is up [10:18:31] I have no idea where the "error" shown in Horizon is coming from [10:18:53] "openstack database instance show" does not give more details [10:20:43] I will try rebooting the Trove VM... [10:25:39] ok after the reboot mariadb is failing to start [10:26:19] "Can't start server : Bind on unix socket: Permission denied" [10:27:19] hmm [10:28:35] this seems a trove issue, but I'm not sure how to fix it [10:29:08] I agree, seems to be trove [10:30:22] Maybe it will respond to a reboot request from my user... [10:31:04] a different error at least [10:31:26] I think we're back [10:31:33] yes the error I see is only a warning [10:31:43] DB is back up, and restarted the quarry service, seems to do queries now [10:31:49] nice one, thanks [10:32:01] Oh, thank you for working on it [11:30:47] dhinus: page severity is the one that will page us from metricsinfra itself [11:30:52] (added last week) [11:32:59] dhinus: I've seen that issue before (permission denied when starting the container in trove), the iirc the issue is on trove side, that it does not set the right permissions to the socket on cold start (when you turn off the VM and turn it on again) [11:35:20] Rook: btw. I sent an update to https://gerrit.wikimedia.org/r/c/operations/puppet/+/965514 to allow you to run puppet again on quarry web hosts without messing up the git remote, there were already a few changes pending as it has been disabled for a bit now [11:36:08] Oh thanks, it's fine. Sorry I meant to abandon that [11:42:22] do you plan on keeping puppet disabled on the quarry VMs for long? (I highly not recommend it) [11:47:08] Not really. I'm hoping to move it to k8s. At which point the associated VMs won't be needed [11:48:43] if that's going to take more than a week, I suggest fixing puppet minimally (like the above patch), so they still get basic security and similar while you deprecate them [11:54:32] If it is a problem puppet can be re-enabled. The git repository will need to be updated manually each time it is worked on, but that's alright [11:55:29] what is the git remote it should have? [11:55:45] (the above patch sets it to the github repo) [11:55:53] https://github.com/toolforge/quarry.git [11:56:01] that's what the above patch does [12:00:50] I guess merge it? I'm not sure, I find the puppet repo frightening. People get upset with me when I touch it [12:07:56] Well, a good way of getting over fear is facing it :), I don't think anyone will get upset with this patch, so feel free to +2, merge and don't forget to run 'sudo puppet-merge' on the puppetmaster1001 [12:09:01] for patches that modify more base shared resources, we have to be a bit more careful as other machines might reuse it (pcc helps a lot there), but for toolforge specific only us use it [12:09:21] let me know if you want more help to merge or test or similar [12:25:54] * dcaro off