[01:09:00] * bd808 off [04:22:04] * dhinus paged: ToolsToolsDBWritableState [04:22:34] restarting mariadb [04:23:31] it's taking a long time to restart [04:25:01] the log says "mysqld: Aria engine: starting recovery", which is expected but usually faster [04:28:24] "mysqld: Aria engine: recovery done", it took 5 minutes [04:29:09] setting mariadb to read-write [08:49:06] morning [09:21:55] it seems we are already gathering the smart statistics from the cloudceph hosts (something might have gotten fixed late november, there's no data from before) [09:22:21] cool! [09:22:37] and there's a bunch of small increases [09:22:51] not cool :/ [09:22:56] I've put it here https://grafana-rw.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1 [09:23:02] it also affects cloudrabbit, and the cloudnets [09:24:57] I'm trying to visualize other smart data to see if anything else pops out [09:56:31] * dhinus paged ToolsToolsDBWritableState [09:56:44] My laptop is rebooting for an upgrade [09:56:59] dcaro can you restart mariadb? [09:57:09] sure [09:57:16] thanks [09:57:19] just the service? [09:57:22] Yes [09:57:45] it's up already [09:58:00] Sometimes systemctl restarts it automatically [09:58:09] then you just need to set it to rw [09:58:13] done [09:58:27] it had to repair the heartbeat table it seems [09:58:34] thanks. I'll look at the logs as soon as my laptop reboots [09:58:50] there's always one or more repaired tables after a OOM [09:59:19] ack [09:59:49] I have a patch that might help, if you can review it [10:00:37] https://gerrit.wikimedia.org/r/c/operations/puppet/+/983221/ [10:16:28] maybe one hour is not too long, but looks ok [10:18:11] +1d, we can tweak the timeout later [10:20:00] thanks, merging now [10:54:56] the toolsdb patch is now applied, I spotted a silly mistake as the timeout is correct (3600) but the slow query log is logging queries longer than 30 seconds rather than 30 minutes :P [10:55:11] xd [10:55:39] and I forgot to make the db read-write after restarting, fixing now [10:55:45] that's why it's so useful to add the unit to the config var name xd [10:56:22] yes! [10:59:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/983368/ [11:54:52] If I wanted to run something in a container, which cannot touch production, is there a convenient way to do that in Open Stack? There's something regarding building a cluster in Horizon, but that's honestly more than I wanted. Basically I just want to be able to spin up three or four containers in the same environment as a VM [12:10:58] We don't have a "container as a service", I think that the closer would be spinning up a VM, install docker/podman/whatever cr and then spin up the containers there [12:11:27] otherwise if you have just code, you can try toolforge (python/java/ruby/php/dotnet/...) [12:17:37] this is awesome! https://pypi.org/project/prometheus-pandas/ Yuvi is a contributor too xd [12:18:22] I tried to apply the s/30/1800/ toolsdb fix dynamically, but clients that are already connected are retaining the old value, and so too many queries are logged (including those 30 ack [12:20:22] restarted and set to rw [12:40:03] dcaro: Thanks, I didn't think so, but it's easy to miss a service :-) [12:41:37] yw :) [12:41:39] * dcaro lunch [14:07:51] komla: Sorry, was sleeping, did you try the scripts? [14:18:12] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/40 (fixing the bump_version script for versions >0.0.9... they can have two digits xd) [14:21:00] added a comment [14:21:09] andrewbogott: yeah, I sent the message late. [14:21:40] dcaro: I had the same idea of taavi, but I'm testing it and I'm not sure it works [14:21:46] andrewbogott: where's the script? [14:23:46] https://www.irccloud.com/pastebin/aaJdqoqC/ [14:24:56] komla: can you try that on your sacrificial tools and confirm that things actually stop (and that restarts are blocked?) [14:28:19] taavi: dhinus just changed it [14:28:47] thanks [14:28:52] * dhinus always takes 5 seconds to find where the Approve button is :D [14:31:08] andrewbogott: I'm on it [15:34:57] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/42 [15:36:38] lgtm, approved [15:37:42] thanks :) [15:38:32] We're trying to get komla logged in to tools-sgegrid-master. It fails but, weirdly, he can access tools-sgecron-2 just fine. Is there any reason there would be a difference between the two? [15:39:03] * taavi looks [15:40:13] he's a member of the 'admin' tool which I thought was all we needed for this [15:41:58] the grid master does not accept host-based authentication, so sshing to a toolforge bastion and logging in from there won't work. komla: try running the ssh command from your local machine [15:47:44] I think I'm going to call it a day [15:48:10] So long dcaro, have a good break [15:48:24] happy new year everyone, see you on the other side :) [15:48:31] Oh yeah, komla, to be clear that proxy command I pm'd you should go on your local laptop .ssh/config [15:48:39] enjoy your break, see you next year! [15:49:12] komla: I didn't even think to ask that -- everything should be happening directly from your laptop, it's never a good idea to ssh between VMs. (Maybe you're already doing it from your laptop, i'm not sure) [15:49:48] andrewbogott: yeah, my configs are ./ssh/config on my laptop [15:50:20] ok, so ProxyCommand is probably what you need then [15:55:06] i'm able to get into bastion.wmcloug.org. from there, i get a hostname resolution error for tools-sgegrid-master [16:01:59] which exact command, which exact error? [16:08:05] komla: now that you can get into the tools-sgegrid-master can you also reach the cron host via the same command? [16:10:18] that was always working, i could reach the cron host via the toolforge bastion. now let me test it with this too [16:11:08] Right, I'm trying to train you to use ProxyCommand instead :) [16:13:03] yes :) [16:13:11] It is successful [16:13:15] thanks! [16:14:28] great! [16:19:50] the toolsdb memory situation is not looking good: https://grafana.wmcloud.org/goto/Y0yPY64Sk?orgId=1 [16:20:15] since I enabled the query timeout only 1 query was killed (again on the "persondata" tool), but that didn't seem to fre any memory [16:21:26] free memory is slowly going towards zero, and there is only one long-running query at the moment [16:22:00] I'm reading more stuff online, one thing I can try is enabling performance_schema which will give us some more debugging info [16:23:29] free memory on the system, or just within mariadb? [16:23:38] oh that's the instance [16:24:03] yes [16:24:12] dhinus: I think you should bring this up in data-persistence, it seems like something must be happening that they haven't seen before. [16:24:31] I discussed it with manuel in October... I might ping him again [16:24:35] I guess automatically restarting every day won't actually help will it? Since sometimes it ooms in just a few hours [16:24:47] yes it seems it ooms in about 5 hours [16:24:52] at the current rate [16:24:53] Yeah, I think that he should be looped in basically every time this happens. [16:25:39] We can build a host with more RAM but it seems like that will just delay the issue by a few hours [16:27:57] 64G seems plenty [16:28:12] and yes, it seems there is no upper bound to memory usage :D [16:28:29] I'm adding some details to T353093 and looping manuel in [16:28:29] T353093: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 [17:03:35] dhinus: is tools-db-1 intentionally read-only? [17:03:42] ah sorry [17:03:57] fixed [17:04:18] I pre-emptively rebooted it because it was about to OOM [17:04:26] but forgot to make it rw :/ [17:06:53] I enabled performance_schema as I was rebooting, and I'm trying to find if there's any useful info in there [17:26:59] dhinus: you might want to reply to wikitech-l [17:27:09] RhinosF1: thanks [17:33:04] repliede [17:33:07] *replied [17:33:43] I'm not finding any useful information in performance_schema unfortunately :/ [17:34:53] :) [17:35:05] Have a good festive break whenever you finish [17:35:10] thanks :) [17:56:18] I'm taking a break but I expect ToolsDB to crash again in a few hours if it keeps on following the same pattern :/ [17:57:21] I don't have other ideas at the moment other than restarting it pre-emptively when the free memory goes below 5G [17:57:44] https://grafana.wmcloud.org/goto/Y0yPY64Sk?orgId=1 [17:57:50] I'll keep an eye on it too [17:57:57] thanks [17:57:59] Ideally it would page us /before/ it goes down I guess [17:58:24] we could easily add an alert when it's at e.g. 5G left, but that would leave us only 30 mins to restart it [17:59:26] I'm too tired to do it now... I might have a go tomorrow [19:35:39] bd808: thank you, as always, for providing so much user support! [20:18:35] what does it mean when qdel says 'is already in deletion' but the job shows no sign of actually exiting? [20:18:53] it means that the grid engine has lost track of the job [20:19:16] is there any way for me to actually stop it? -f doesn't seem to help [20:19:18] try `qdel -f ` as the user, and if that does not help as root on the grid master [20:19:28] oh, haven't tried that last [20:23:03] Hm, I'm setting quotas for tools and then after a few minutes the quota I set just goes away and the job starts up again [20:23:19] Is there a reconciliation loop running someplace outside of the grid that maintains grid quotas? [20:25:01] which tool? languageproofing-ui? [20:25:18] yes [20:25:20] Dec 15 20:22:11 tools-sgegrid-master disable_tool.py[15573]: root@tools-sgegrid-master.tools.eqiad1.wikimedia.cloud removed "languageproofing-ui_disable" from resource quota set list [20:25:40] I wonder if it's my disable-tool cron that's restoring it [20:25:44] yes, it is [20:25:49] Well, that's poetic [20:26:01] I will fix that :) [20:26:12] But first, gotta restart toolsdb again! [20:26:53] ouch [20:27:43] it's running down several times per day now [20:39:41] taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/5 [20:40:24] looking [20:41:59] added a comment [20:47:44] I filed tasks under https://phabricator.wikimedia.org/T353551 for the worst-doing tools on toolsdb idle connections [20:49:49] thanks! I responded to your comment [20:57:44] komla: ok, the stop/disable scripts are mostly working now. [21:00:30] You can go ahead and start shutting down the first batch of tools, at your convenience. Two points: 1) do please spot-check to make sure things are actually getting killed, and 2) don't count on 'webservice status' or 'qstat' to be reliable about that, since in some cases the tool will /try/ to start things but they won't ever get scheduled (so e.g. qstat will show a job but it won't be in a queue.) [21:14:28] andrewbogott: <3 My brain is still pretty sure that I should do the job I signed up for in 2016 of supporting Toolforge users. It is like a compulsion. [21:15:06] You definitely don't /have/ to do it but it's much appreciated. [22:52:49] * taavi files T353566