[08:25:11] dcaro: good morning. I have been thinking [08:25:45] what if the newer ceph nodes were directly put into service with a single NIC setup. Did you consider that? How big/bad would be the impact of that? [08:56:28] FIRING: InstanceDown: Project tools instance tools-puppetserver-01 is down ?? [08:57:12] seems down for real, no ssh for me [08:58:49] is back [08:58:51] [Thu Jul 11 08:57:42 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/puppetserver.service,task=java,pid=569,uid=104 [08:58:52] [Thu Jul 11 08:57:42 2024] Out of memory: Killed process 569 (java) total-vm:37571324kB, anon-rss:32352848kB, file-rss:0kB, shmem-rss:0kB, UID:104 pgtables:65848kB oom_score_adj:0 [08:59:02] the puppet server service got oomkilled [08:59:11] * dcaro in a meeting [08:59:17] oh, that has happened before iirc [08:59:26] sinc the upgrade [08:59:42] did it restart by itself? [09:00:12] yeah, got restarted [09:00:58] 👍 well, maybe open a task and ping andrew there, iirc he did the last round of improvements in that regard, he might have some ideas on what to test next [09:01:07] ok [09:01:54] about the ceph nodes, the impact is the same as if they are changed now, the concern is not the setting up (hosts can be reimaged without issues), but the traffic load and such, specially with new hosts that have more data each [09:02:18] (and we are already easily hitting the network saturation point when rebalancing) [09:02:53] it would be better for example if we upgrade to 100G nics, then one nic would be better than two nics now [09:03:53] but if ceph tries to rebalance as fast as possible, we could also saturate the 100G interface [09:05:57] T369797 [09:05:57] T369797: toolforge: puppetserver got OOMkilled - https://phabricator.wikimedia.org/T369797 [09:07:14] yep we could, but it would take twice.5 times the throughput xd [09:08:49] my point is, a more powerful NIC may not prevent the saturation problem, therefore I believe 2 NICs also may not prevent it [09:09:33] so maybe the conclusion here is we need to test the QoS bits that cathal mentioned yesterday :-P [09:21:27] dcaro: I'm merging this change as previously agreed https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/407 [09:22:07] ack, just make sure to test it [09:22:14] I did [09:23:06] about the network, having 2,5 times the throughput I think is a big advantage, but yes, the QoS bits is also a good improvement [09:23:25] none of them would prevent the cluster from crashing if the network gets saturated [09:25:00] TIL: git now detects a commit -> revert -> revert as 'reapply' :-P [09:25:05] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/165 [11:30:46] blancadesal: hello! [11:30:56] we are now in the toolsbeta k8s upgrade windonw [11:31:01] window* [11:31:01] arturo: we are indeed [11:31:20] we should get this deployed first https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-admission/-/merge_requests/6 [11:31:34] arturo: dcaro and I are in the meet [11:31:58] ok, joining [14:00:44] I will be a bit later to the decision request meeting [15:29:07] andrewbogott: are you playing with metricsinfra-db-1 ? [15:29:26] (playing as in working with to upgrade) [15:29:29] I did yesterday, I'm not working on it today (so far). [15:29:41] okok, so the db seems to be down, I'm looking [15:29:43] I thought I left it in good shape, is it misbehaving? [15:30:03] Hm, I'd start by clicking 'restart' on the database tab, want me to do that? [15:30:47] (done) [15:30:57] yes please [15:31:21] the os seems to be complaining that network.service is failing, and it's trying to bring up eth0 (but the interface name is ens3) [15:31:31] huh, that's interesting... [15:31:54] 2021-10-06 8:55:43 187260 [Warning] Aborted connection 187260 to db: 'prometheusconfig' user: 'configuser' host: 'metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.clou' (Got timeout reading communication packets) [15:31:56] from the db logs [15:31:59] there's many of those [15:32:19] it's now reporting as healthy, can you connect to it now? [15:32:39] hmm curl: (6) Could not resolve host: metricsinfra-controller-1.metricsinfra.eqiad1.wikimedia.cloud [15:32:44] ack looking [15:33:10] looks good yep :) [15:34:11] ok. My upgrade attempts yesterday were frustrating, the APIs lied to me and told me it was running Jammy when it is clearly still running the same Bionic image as before. [15:34:18] Weird because when I make test cases they all upgrade properly [15:34:18] I think that the logs I was seeing are just old [15:34:35] So I need to be more creative in making test cases I guess [15:35:04] logs look ok (if you don't pass `--tail 10` to `docker logs -f` it shows all of them, that seem to start a few years ago) [15:35:43] andrewbogott: maybe it's hardcodded in the agent? that would not be nice [15:35:50] yeah, I got caught with that yesterday, I walked away and came back later and it was still scrolling [15:36:12] I don't think it's doing anything on purpose, I think it's simply not reimaging and saying that it is. (if I leave test files behind they aren't replaced either) [15:36:24] probably it's crashing somewhere and the error isn't getting handled properly. [15:36:56] (that giant logfile is further evidence that it didn't actually reimage) [15:37:19] Oh, and that's not a Trove thing, I produced the same error (fake reimage) with nova laone [15:37:22] *alone [15:37:46] ooooh, interesting [15:38:12] so for once it's not Trove being halfassed [15:40:36] xd [15:41:06] it seems that the db was down since yesterday: [15:41:08] https://www.irccloud.com/pastebin/9ylrmtF3/ [15:41:14] dang [15:41:24] trying to restart but failing to bind with that error :/ [15:41:29] weird [15:41:30] it reported as Healthy but it must've changed its mind [15:41:54] wait, it got healthy at some point, and then started failing again [15:42:27] I tried to reimage several times yesterday which would have made it cycle through states each time [15:42:42] so anything that's from ~18-20 hours ago is that [15:43:55] the last time it was healthy was [15:43:56] 2024-07-10 21:02:52 0 [Note] mysqld: ready for connections. [15:44:20] then a minute later restarted and started failing [15:44:20] 2024-07-10 21:06:57 0 [ERROR] Can't start server : Bind on unix socket: Permission denied [15:44:29] (4 minutes actually) [15:45:11] not sure if that helps, though we might want to add an alert for the grafana service [15:45:19] (like the ui), even if it's just a warning [15:45:22] this is "fun" T369840 [15:45:23] T369840: `toolforge build run ...` can fail due to docker.io image pull rate limits - https://phabricator.wikimedia.org/T369840 [15:46:20] what has it to do with docker.io? :? [15:46:25] the build this just failed on takes ~25 minutes and failed about 22 minutes in [15:47:10] https://www.irccloud.com/pastebin/nz8o9iBx/ [15:47:16] :/ [15:47:32] can you paste a bit more of the logs? [15:48:01] dcaro: yeah, I will do that. I'm not sure it will help, but maybe [15:48:39] at least might give a better idea on which part of the process is getting stuck, I see it's the export, but will need to trace it a bit more [15:50:40] bd808: thanks for poking around the internals of the build service :) (and all the toolforge bits) [15:51:33] https://phabricator.wikimedia.org/T369840#9974150 [15:52:11] yw dcaro. I'm actually relieved to hear this work I've been doing is not annoying you :) [15:52:38] ooohhh, it seems the analyzer process it's pulling the runner image from docker for some reason [15:52:39] [step-analyze] 2024-07-11T15:45:56.773197832Z ERROR: failed to initialize analyzer: getting run image: connect to repo store "heroku/heroku:22-cnb": GET https://index.docker.io/v2/heroku/heroku/manifests/22-cnb: TOOMANYREQUESTS: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit [15:57:19] now getting: [step-analyze] 2024-07-11T15:52:45.925041870Z ERROR: failed to initialize analyzer: getting run image: connect to repo store "heroku/heroku:22-cnb": GET https://index.docker.io/v2/heroku/heroku/manifests/22-cnb: TOOMANYREQUESTS [15:57:30] ah. that's the same [15:57:46] yep, the logs get a bit scrambled because those pods are started in parallel [15:57:51] so yeah, seems like there should be a local cache somewhere that is getting skipped [16:01:40] * arturo offline [16:02:17] * bd808 will watch https://docker-ratelimit.toolforge.org/ to see when he can try to build again [16:03:28] ack, I'm trying to figure out where/why we are pulling anything from docker.io, but might take a while [16:03:48] #hugops dcaro :) [16:03:52] and why the registry admission allows that in the first place [16:04:10] wait, that may be some case of docker-in-the-pod [16:08:03] it's the analyzer (lifecycle binary) pulling info from docker by itself [16:23:05] tcpdump on lima-kilo shows also that it reaches to index.docker.io in the analyze step [16:23:21] and also later :/ [16:23:28] That was the first failure I noticed [16:23:40] probably yes, the export step [16:35:17] hmm... I suspect that the runner override that we do is not being picked up, and it's pulling the one from heroku [16:36:20] even though it's there [16:36:21] https://www.irccloud.com/pastebin/LdSPQHDb/ [17:07:41] I think I might have a fix [17:40:39] okok, merged the fix, will deploy and test soon [17:43:27] I started a build before your merge and am now crossing my fingers that "ratelimit-remaining: 1;w=21600" will not be eaten up before it finishes :) [17:44:12] xd [17:49:00] build success! [17:49:52] \o/ [17:49:56] just deployed the fix too [17:51:02] (testing it) [17:51:08] build passed [17:59:09] * dcaro off [17:59:26] andrewbogott: feel free to page me if anything blows up again and you need help xd [18:01:48] ok!