[07:24:53] greetings [07:51:50] I was taking another look at ceph, assuming sum(ceph_pool_objects) is indeed the sum of all objects, the spike the other day meant we asked ceph to shuffle around ~50% objects https://grafana.wikimedia.org/goto/24FpR8XNg?orgId=1 [07:54:18] in other words my explanation for what happened is "we asked ceph to do too much work all at the same time" [08:18:06] Interesting, feels weird that adding 5% storage (cloudcephosd1042 is 28TB of 765TB currently) shuffles 50% of the objects around [08:18:18] I guess that the next question is what's the bottleneck? [08:18:53] as in, it should be able to just go shifting things around without breaking [08:22:58] heh good question, not sure what are upstream recommendations re: putting osd in service and how gradually [08:28:43] questions currently in my mind: "how hard can we push OSDs into service without impact?" and then for us to decide "is that acceptable?" [08:29:06] afaik it's kinda, add all the osds at once without letting the cluster rebalance (so it does the placement calculations), then turn on the data shifting and it will figure it out [08:30:10] interesting, is that what undrain_node should do or is supposed to ? [08:30:44] we have that more or less handled with the cookbooks, the issue comes in other scenarios, for example if a switch needs to be turned off for upgrading, it would take around a week to drain, and around another week to undrain the rack (probably more nowadays) [08:31:03] and if that switch goes down, that is a lot of trouble (happened already a few times) [08:31:44] kinda yes [08:32:39] this are some numbers we were looking into last year [08:32:39] https://docs.google.com/document/d/1UtMK8ZLLfn1CFbcgBccBvzTlIuab244XAs98_gTuBlg/edit?tab=t.iylxhjq1hl14 [08:32:55] thank you, looking [08:35:35] we decided to go for double switches on each rack + 25G network, but that will take some time to get there completely, that should alleviate considerably the issues with switches going down/rebooting [08:36:24] so maybe that's a good tradeoff as we are (adding nodes slowly, agreeing to have downtime if a rack goes down, having HA at the switch level) [08:36:41] did I get it right that switch/rack going down means a ceph outage ? [08:36:56] yep, at least all times that happened we had an outage [08:37:03] (we are only in 4 racks) [08:38:31] ok thank you, I'm currently flabbergasted at this fact, mostly wondering if that's expected by upstream [08:39:06] morning! [08:39:46] moring :) [08:39:50] *morning [08:40:28] greetings [08:40:35] that == a rack going down means an outage [08:40:45] godog: the setups I've seen before have redundancy at the network level, and higher network (using fiber and such), but they also had way more money xd [08:41:27] we also discussed having _more_ racks, so a rack going down is in percentage a smaller impact on the cluster [08:42:15] yep, some details in the doc [08:42:22] heh, was the outage affecting all pools? i.e. both with r=2 and r=3 ? [08:42:23] godog: my understanding (dcaro probably knows more) is that the upstream (ceph) expectation is that the cluster will be fine if a small percentage of hosts go down [08:43:09] in theory even a big percentage goes down, as long as there's no other issues, the cluster should be able to get going [08:44:24] heh yeah exactly, conceptually with r=3 and data spread across racks then a rack going down should be zero / little impact [08:44:26] godog: the outages became complete, for example, at some point the heartbeats between osds started failing, and osds started flagging each other as down, so the cluster started flagging all osds down, taking eventually the whole cluster down (that did not happen this time, allegedly because QoS now prioritizes the heartbeat traffic) [08:45:19] another outage was when the switch started dropping jumbo frames, so heartbeats would work but traffic would not [08:45:43] (we added a bunch of checks now for jumbo frames) [08:46:02] so you have some hope that right now we could take out a rack without an outage? (having fixed the previous issues) [08:46:17] there was another issue when the switch started misbehaving and dropping traffic (a bug) and needed upgrading [08:46:27] not suddendly [08:46:35] (from what we saw the other day) [08:47:10] there might be more some hope though, might depend on exactly what was that made things fail [08:50:09] ack, thank you [08:50:26] we have also some inbalance currently in the racks [08:50:49] should get better soon though [08:50:51] https://www.irccloud.com/pastebin/OEk2UGuC/ [08:51:15] *imbalance [08:52:35] (the numbers in the second column are the TB on that rack) [08:52:43] TB assigned, not used [08:52:59] as in capacity [08:56:13] * godog nods [09:01:03] bbiab [10:21:06] can I get a quick review for https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/64? [10:21:36] sure, LGTM [10:21:51] thanks [10:22:17] it's creating an unrelated volume in toolsbeta? [10:22:27] ["toolsbeta-harbor-2"] will be created [10:23:07] uhh [10:23:24] has someone touched that volume manually? [10:23:33] I think it was deleted manually probably [10:23:34] that plan is a week old, /me runs that again to be sure [10:23:36] it was a test instancce [10:23:38] I think [10:24:29] yeah, that volume does not exist [10:25:11] I will merge this but not apply it, and then do a separate MR to drop that from the code [10:25:22] I think you can also let tofu create it, and create a separate MR later [10:25:34] whatever's easier [10:26:11] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/65 [10:27:45] +1d. pipeline failed though, maybe network error [10:28:57] * taavi retries [10:38:14] * dcaro lunch [10:47:50] that repeated itself enough times that I filed T403028 [10:47:51] T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere - https://phabricator.wikimedia.org/T403028 [11:23:09] when someone has time, I have MRs to provision a new trixie toolsbeta bastion via tofu: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests [11:54:25] I would and don't have enough context, sorry [12:01:27] taavi: got some questions in a couple of the patches, feel free to merge if the answers are "it's ok/expected" :) [12:02:10] FYI I changed https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview to use UTC not the browser timezone [12:02:24] nice 👍 [12:02:36] dcaro: thanks! replied, will merge those a bit later [12:25:36] dcaro: FWIW I'm checking the ceph-mgr logs and it looks like object misplaced was ~16% max the other day, clearly sum(ceph_pool_objects) is another type of object or sth similar [12:25:59] I was wondering yep, if those objects are different things [12:25:59] 16% seems indeed more realistic [12:26:52] yeah I couldn't find a total of "objects" as intended by mgr [12:29:40] feel free to start putting info in T403043 [12:29:41] T403043: [ceph] 2025-08-27 ceph outage when bringing in a big osd host all at once - https://phabricator.wikimedia.org/T403043 [12:30:19] ack, will do [13:01:31] who was handling dumps lately? it'sfailing to sync kiwix (connection refused from the other end), so probably there's some firewall or similar blocking it [13:01:55] dcaro: often it's just the upstream that goes down (at least it happened a few times) [13:02:04] I think the alert is resolved now? [13:02:14] I manually started it, but it failed again [13:02:35] can you check if you can reach that host from your computer? [13:02:53] ping works [13:02:57] from both [13:03:18] http works too master.download.kiwix.org [13:04:11] prometheus stopped firing 4 minutes ago, apparently [13:04:33] I see a "resolved" message in #-cloud-feed [13:05:04] I think it decided that as I manually started it it was not in error anymore [13:05:11] should trigger again in 15min or so I guess [13:05:38] wait, it has not failed yet, it's just hanging, might be doing stuff [13:06:01] I'll let it do it's thing, see if it finishes ok [13:10:24] okok, I think I'm done triaging tasks xd (my lima-kilo got rebuilt), time to test patches [13:18:51] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/269 [13:19:16] lima-kilo startup order (setup ldap before toolforge components essentially, so you don't have to wait for a maintain-kubeusers run) [13:26:36] dcaro: +1d [13:27:08] thanks!@ [13:49:38] puppet is failing in metricsinfra, looking [13:50:08] `GPG error: http://mirrors.wikimedia.org/osbpo bookworm-dalmatian-backports-nochange InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 56056AB2FEE4EECB` [13:50:10] hmm [13:50:14] upstream issue? [13:56:30] our side, though that's not what makes puppet fail, looking [13:56:43] oh, the alert went away [13:56:49] and puppet is passing when I run it manually :/ [13:57:26] the journal logs don't show any errors besides that warning message for the last 3 days at least :/ [13:57:35] dcaro: you can just remove those sources if you want, we don't install them on new VMs anymore. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147166 [13:58:23] the ones that failed are not that one... xd, but metricsinfra-alertmanager-2 [13:58:29] andrewbogott: okok [13:59:28] the failure was git pull for prometheus-configurator, works now, probably network blip (maybe on gitlab side) [14:02:34] topranks: want to come to network sync?