[00:02:21] looks like the time-to-oom for toolsdb is getting even worse, so I expect a couple more crashes during the night. [00:03:18] I'm on call from 3 UTC and I'll handle the restarts after that time [00:04:17] I'm tempted to try a much stricter max_statement_time, like 5 minutes instead of 60 minutes, but I'm not sure that will help [00:04:51] I saw the tasks created by t.aavi and that makes me think we can also lower the max connections per user [00:07:24] I do agree with a.ndrew that we should upgrade asap but I'd also like to find a temporary fix for 10.4 first [00:20:47] I'm pre-emptively restarting toolsdb now, the time-to-oom after a restart is around 4 hours on the last two restarts [00:22:00] restarted and set to rw [01:26:03] * bd808 was never here but is now also off [01:40:16] I am here to see if it needs a restart but you beat me to it [05:34:31] I restarted it, yet again, hoping that Francesco is awake for the next one [08:17:06] I've restarted toolsdb once more, and I'm testing using jemalloc as suggested at https://stackoverflow.com/a/60488432/1765344 [08:40:08] I think jemalloc did the trick! https://grafana.wmcloud.org/goto/Y0yPY64Sk?orgId=1 [18:25:37] andrewbogott: webservicemonitor has filled the lighttpd grid queue with jobs for languageproofing-ui, which is making the grid dashboards useless :/ I think that needs to be fixed before blocking any more tools from the grid [18:25:46] see for example https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1&viewPanel=8 [20:48:00] bd808: did you already rebuild the containers after merging https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/983520/? [20:50:02] taavi: no, I haven't yet. I was waiting for zuul and then I got distracted. [20:52:39] * bd808 will build them now [20:58:32] grrrr... "W: GPG error: http://apt.wikimedia.org/wikimedia bookworm-wikimedia InRelease: At least one invalid signature was encountered." [20:58:58] * bd808 retries incase this is somehow transient [21:00:32] sounds like out-of-date base image? [21:05:45] taavi: that seems important, I'll try to look at that soon [21:05:57] bd808: is that happening on a fresh VM? Or in a container? [21:06:44] andrewbogott: it is a error rebuilding our docker containers. and it usually is a stale image like taavi noted. [21:07:20] ok! I rotated our bookworm glance image last week, wanted to make sure it wasn't that. [21:07:53] dhinus: that's awesome! The free memory graph is now very different [21:07:59] I think `docker pull docker-registry.wikimedia.org/bookworm:latest` will fix it for me... [21:08:44] I should probably add a pull like that to the rebuild script for each base image [21:13:46] ugh. still failing in the same way. I suppose that means that docker-registry.wikimedia.org/bookworm:latest actually needs to be updated? [21:17:05] * bd808 tries with :20231210 [21:29:25] what command is it trying to run? [21:33:27] taavi: basically `apt-get update` is failing because of signatures. I think I may have figured it out (I think :latest tags are not being updated), but now I'm also hitting disk space issues on the builder. [21:33:53] * bd808 is running `docker image prune -a` to see if that is sufficient [21:34:54] "Total reclaimed space: 38.07GB" -- that should help for a bit [21:43:06] things seem to actually be building now. I haven't tried to pull the metadata to prove this, but I think that docker-registry.wikimedia.org/bookworm:latest is actually older than docker-registry.wikimedia.org/bookworm:20231210. I have some vague memory of this happening before. [21:44:39] my vague memory is also that some SRE told me that :latest is the devil and that we should always use specific tags. [21:45:19] * bd808 mumbles about things that "just work" for years and then rot into garbage without warning [21:55:21] taavi, here's another one. This one is just doing what Bryan suggested yesterday: https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/7 [22:03:19] commented [22:04:07] taavi, what do you think about [22:04:09] https://www.irccloud.com/pastebin/XlCxaXOG/ [22:04:48] why not just parse it in a try-except block? [22:05:12] that file is yaml? [22:05:26] yes [22:43:33] taavi: images are rebuilt and pushed (finally) [22:45:26] I ended up using the :20231210 tag for all 3 base images (buster,bullseye,bookworm) [23:10:59] * bd808 wanders off