[08:30:21] morning :) [08:30:30] and happy new year! [08:57:12] hello, happy new year! :) [09:12:56] morning! happy new year :) [09:38:35] can I please get a +1 on T354060? [09:38:36] T354060: Add DB quota for Wikiapiary - https://phabricator.wikimedia.org/T354060 [09:40:59] dhinus: done [09:42:15] thanks :) [10:11:59] tools-harbor-1 seems to be a bit flaky, I count 9 "instance down" alerts in the past week [10:14:02] I tried SSHing and it did not work. I've just hard-rebooted it from horizon [10:14:22] up again [10:15:05] the UI was working for me [10:15:29] hmm, maybe I should've tried that first :D [10:15:44] (if ssh did not work, something was wrong anyhow) [10:15:47] but there's definitely something going on, maybe resoucres? [10:15:50] *resources [10:17:21] the oom triggered yep [10:17:22] Jan 02 10:13:26 tools-harbor-1 kernel: Out of memory: Killed process 262197 (redis-server) total-vm:16464076kB, anon-rss:15347316kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:30156kB oom_score_adj:0 [10:18:20] :/, I thought that had improved, but it's back using almost 16G [10:23:19] it's already using all the memory again, and ssh is failing, but virsh console works [10:24:04] I think that it's restoring the redis DB from disk and filling up the disk [10:24:15] (docker inside the VM, through docker-compose) [10:24:19] load average 47 [10:25:13] it's swapping apparently, top shows "kswapd" at the top [10:25:47] but "free" doesn't show any swap being used [10:26:46] redis-server is using 93% of memory [10:28:13] does it have swap? [10:28:21] oh, yes that xd [10:30:37] now even virsh console is completely stuck [10:31:10] hmm, I think we have to stop harbor from going up right after boot, then bring redis up, and try to cleanup there [10:32:49] I'll stop poking at it and let you try that [10:33:09] I can file a task in the meantime [10:33:45] yes thanks [10:34:17] (for the task, you can still poke around if you want :) ) [10:34:55] I will keep an eye but I won't make any change so we don't overlap :) [10:41:11] T354176 [10:41:12] T354176: [harbor] Redis using all available memory - https://phabricator.wikimedia.org/T354176 [10:42:22] 👍 [10:44:21] We might want to reconsider T344433 [10:44:21] T344433: [harbor] See if we can replace the per-project cleanup policy with a harbor-wide one - https://phabricator.wikimedia.org/T344433 [10:47:51] how do you connect with redis-cli? I get "connection refused" [10:49:27] root@tools-harbor-1:/srv/ops/harbor# docker exec -ti redis bash [10:49:36] and then `redis-cli` [10:49:47] maybe directly running `redis-cli` might work too x [10:49:56] yep, should do [10:50:18] redis-cli without docker gives me "Could not connect to Redis at 127.0.0.1:6379: Connection refused" [10:50:39] but with docker it does work [10:51:23] I wonder if this might help: "If maxmemory is not set Redis will keep allocating memory as it sees fit and thus it can (gradually) eat up all your free memory. Therefore it is generally advisable to configure some limits." [10:51:25] https://redis.io/docs/management/optimization/memory-optimization/ [10:51:50] "maxmemory" seems to be 0 at the moment [10:52:49] what's the behavior when it reaches the limit? Does it fail or removes older entries? [10:53:06] would be nice if it did not just fail xd [10:53:50] this is the issue on harbor side https://github.com/goharbor/harbor/issues/8537#issuecomment-523736046 [10:54:19] "It makes Redis return an out-of-memory error for write commands if and when it reaches the limit - which in turn may result in errors in the application but will not render the whole machine dead because of memory starvation." [10:55:04] well, it's an improvement :) [10:55:18] would we notice if that happens though without extra alerts? [10:56:05] yeah I'm not sure if that makes things better or worse [10:57:13] it's failing to connect to the db now it seems :/ [10:57:37] (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))"}] [10:58:28] maybe there's some leftover connections timing out from the outage :/ [11:05:51] I can connect with redis-cli, INFO shows "connected_clients:22" [11:06:15] and "maxclients:10000" [11:06:40] ah but the error is SQL not Redis [11:06:59] oh yes, sorry, sql :) [11:07:32] Trove? [11:08:14] you can try restarting the Trove database from horizon [11:09:20] just restarted the db to flush the connections and the core service was able to start [11:09:28] ssh to it then docker restart [11:09:33] why do we have two Trove instances, harbordb and tools-harbordb? [11:11:06] tools-harbordb is for tools, the other is for toolsbeta (we created it without the prefix at the beginning, noticed later) [11:11:26] we should recreate the toolsbeta one eventually [11:11:52] ok! there doesn't seem top be a "rename" option unfortunately :/ [11:17:57] I added a cleanup log rotation of 1h (it's now possible) [11:17:59] https://usercontent.irccloud-cdn.com/file/Zjdc30UB/image.png [11:46:11] nice, do you think logs make up a big part of the Redis data? [11:54:51] yep, every cleanup run for every project + every global cleanup creates a redis entry with the logs in it [12:21:45] * dcaro lunch [13:37:27] dcaro: are we (or should we be) planning on making it possible for tools to request a higher harbor quota similar to how that is possible for other resources? [13:40:17] I think we talked about that before the holidays, it does not seem it was mentioned in the last meeting [13:41:23] iirc there were some patches to set the quotas on the projects already [13:41:54] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/22 [13:42:14] yeah, I remember we talked about it informally but decided not to do it for the time being [13:42:33] asking because I'm revisiting that patch :) [13:43:20] feel free to raise it in a task, but I think that until we have a usecase we can delay deciding one way or the other [13:43:22] should I at least leave that possibility open, as in making it possible to exclude a list of tools from a global quota change? [13:44:17] I'd say no need, just avoid making the code really hard to modify later [13:44:35] ok [13:45:12] feel free to add a comment for later like "keeping this simple to allow easily adding per-project quotas if/when needed" or such [15:29:17] is there a way to change the MR description in gitlab to be whatever the latest commit message is? (aka again wanting gitlab to be gerrit 😭) [15:31:45] you could hack some script I guess, but as MRs are meant to have more than one commit, having the MR<->commit mapping is not in the design [15:35:52] to complicate things, MR descriptions are markdown-parsed :D [15:36:15] T351253 [15:36:16] T351253: Add support for GitLab markdown linebreak requirement - https://phabricator.wikimedia.org/T351253 [15:39:06] writing this type of scripts is what silent week is for right? ...right? xd [15:39:29] xd [15:39:55] dhinus: I think they were already markdown parsed, that's why the `Bug:` lines get all scrambled up [15:40:11] yes, they always were [15:40:31] but it's confusing because by default they are copied from the commit message, which is NOT parsed [15:41:02] so you write a commit message, you create a MR, and the same content is shown twice in gitlab, with two different parsings [15:41:30] (at least that's my understanding, I lost the motivation to dig deeper :D) [15:42:00] dhinus, I just say the upgrade on T353408, want me to put it back in service are are you already doing so? [15:42:00] T353408: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 [15:43:00] andrewbogott: just seen the reply from jclark, yes let's put it back in service. I checked this morning and it must be added back to the hypervisor list, I'm not sure how to do it [15:43:52] Oh, nova should take care of that automatically once all the services are up. I'll look. [15:44:18] "openstack hypervisor list" doesn't show it [15:44:29] but "openstack compute service list" does [15:44:34] yeah, I just noticed the same. [15:44:36] weird [15:45:02] probably once something is scheduled it'll show up? I'll make a canary [15:45:38] I tried with the cookbook but it fails because it first checks if the hypervisor exists :P [15:45:51] but I haven't tried skipping the check [15:46:53] I am seeing all the same weird things as you, just 30 seconds later [15:46:59] haha [15:47:14] Going to see what nova services think [15:47:18] you'll probably stumble on this link next https://stackoverflow.com/questions/57070101/openstack-compute-node-not-getting-listed-in-hypervisor-list [15:49:24] hm, there is a command to add nodes to the nova cell. That should've been done ages ago but maybe it got removed... [15:51:04] I'm stopping the service for a minute to see if nova-api even notices [15:52:05] it does [15:58:08] dhinus: ok, now my theory is that there's a record of this host someplace in the database with 'deleted=1' preventing discovery. Going to make some tea and then start digging in the database. [16:12:22] hm, nope, it's not marked deleted and it is marked as mapped [16:12:58] is it maybe related to how you removed that host in december? (I remember you did something to stop icinga from complaining about it) [16:13:51] (in retrospect, it was easier to keep those alerts running, but I wanted to clear the alert board before the holidays :P [16:15:07] dhinus, yeah, that's why I thought it must have a negative record someplace [16:15:15] anyway... fixed via 'nova-manage cell_v2 map_cell_and_hosts', a command I'm sure I have never run before. [16:15:35] hahah, TIL [16:16:09] do you want me to run the canary cookbook? [16:16:10] I'm sure that's not needed for new hosts (that's nova-manage cell_v2 discover_hosts) [16:16:22] So must be a corner case I created in my attempt to silence icinga [16:16:27] I'm running the canary cookbook [16:16:32] ok! [16:20:24] now we just wait and see if it overheats and dies again [16:25:18] is it pooled? [16:26:04] yeah, I pooled it because I can't think of a better way to detect if it's fixed [16:26:58] SGTM [16:27:27] the virtlogd systemctl unit is still marked as failde [16:27:41] actually, wmf_auto_restart_virtlogd.service [16:30:02] it will probably clear on the next run in 21 hours [16:31:34] I'm not sure I know how to search for and delete downtimes in alert manager. [16:31:38] poking around [16:33:14] bell icon at the top -> browse [16:34:35] Ah! thanks I was missing the browse tab [16:35:07] I've triggered a run of the failed systemd unit, and that should clear the remaining alerts [16:35:41] ok, deleted downtimes in icinga and alertmanager [16:35:46] thanks! [16:37:05] I think there are moaar downtimes in icinga :/ [16:37:09] for the individual services [16:37:35] (I clicked "downtime" in the sidebar, then searched for cloudvirt1063) [16:41:12] oh, hm [16:42:07] I feel like I'm seeing the same downtime there that I just removed [16:42:30] using the icinga UI makes me reconsider the karma UI :P [16:43:09] I think those ones are still active because I can see them if I go to a single service, e.g. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudvirt1063&service=SSH [16:44:23] yep, my attempts at removing them are mostly not working, still clicking [16:44:40] well worst case they expire in 2 days :P [16:45:44] one might think that 'remove downtime for all services' would do something... [16:46:00] oh, it finally did! delayed reaction I guess [16:46:05] look right to you now? [16:46:25] * dhinus refreshes [16:46:37] yep all good! [17:30:48] no alerts! \o/ [17:32:29] yay! [17:35:03] * dcaro off [17:35:04] cya tomorrow [19:24:25] * bd808 lunch