[03:55:21] In case there are more after-effects later, here's what just happened: [03:55:21] - I added a new ceph node, cloudcephosd1042, with the cookbook. This went haywire because (due to a race condition in puppet) 1042 was running an old version of ceph, v14 (most of the cluster is running v16) [03:55:21] - Somehow when the v14 client tried to talk to the one v17 client (on cloudcephosd1004) it caused the nodes on cloudcephosd1004 to crash [03:55:21] - Again, 'somehow' that crash didn't just cause a rebalance, but caused a bunch of pgs to go read-only. Very weird behavior for only one node going down, but it happened. [03:55:21] - That meant that for a few minutes ceph misbehaved badly enough that some VMs froze, and a lot of toolforge jobs flapped [03:55:22] - As soon as I switched off 1042 and 1004, everything got better [03:55:22] - I restarted some unhappy nfs worker nodes just in case (although I suspect they would've recovered on their own anyway) [03:55:57] I'll have another go tomorrow now that i know to double-check the ceph version on new ods. [03:56:42] Oh, also, when I reimaged ceph OSDs a lot of them raised kernel errors, which I'm confident is unrelated and just a side-effect of reimaging and rebooting. [06:44:24] ack, good luck [06:44:28] and greetings [06:47:57] as heads up, I'll be off this afternoon and tomorrow all day [06:56:54] I see a bunch of alerts for nfs workers, I take it the "fix" is to restart them via cookbook ? [07:13:40] yep, or to run the cookbook to reboot all the nfs workers which is usually the way to go after major ceph/nfs blips [07:18:41] ok that'd be wmcs.toolforge.k8s.reboot + options [07:18:58] yeah, I don't remember the name off-hand but there's some flag to do all the nfs workers [07:19:10] cheers [07:19:39] and it'll take a while to reboot all the workers, but that's fine [07:20:56] ok I'm taking a few minutes to poke things around then will start the reboot [07:21:01] the reboots even [07:28:44] root@cloudcumin1001:~# cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --all-nfs-workers [07:28:47] FTR [07:55:33] morning [07:57:30] we can add a version check for the ceph cookbook to avoid bootstrapping a node if it has the wrong ceph version running [07:57:56] (means that we will have to keep track of that version somewhere in the cookbooks) [08:03:37] hmm.... the bootsrap_and_add cookbook already runs puppet, reboots and runs puppet again, that should have pulled in the newer packages before starting the osds [08:04:28] puppet ordering issue maybe? [08:05:59] ohhh, I think that we don't upgrade the packages from puppet, so it keeps the ones it installed before having the thirdparty repo configured, let me check [08:06:52] in which case you could pull the expected ceph version from hiera instead of having to hardcode that in the cookbook [08:08:31] yep, we don't enforce any version [08:10:08] that breaks the upgrade process though, as it's done with cookbooks, not puppet [08:18:15] on 1004 side, it killed itself as it detected a very old version for the peer `Aug 21 03:33:52 cloudcephosd1004 ceph-osd[7869]: ceph-osd: ./src/osd/PeeringState.cc:1255: bool PeeringState::check_prior_readable_down_osds(const OSDMapRef&): Assertion `HAVE_FEATURE(upacting_features, SERVER_OCTOPUS)' failed. [08:18:15] ` [08:19:26] I'll try bringing it up again (that's the one with v17) [08:22:05] morning, reading the backscroll... [08:22:42] I will check the kernel errors just in case, and comment in T402475 [08:22:43] T402475: KernelErrors - https://phabricator.wikimedia.org/T402475 [08:23:17] it failed :/ [08:45:31] I was looking at the wmcs ceph dashboards, what are the dashboards and timeframes that show the problem ? [08:46:48] also FYI I'll be OOO this afternoon and tomorrow all day [08:49:37] our ceph dashboards are all linked, here's the health one https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&from=now-6h&to=now&timezone=utc [08:50:23] thank you dcaro ! [08:50:56] hmm... I'm seeing a suspiciously persistent loss jumbo frames/pings [08:52:43] just split the lost jumbo frames graph by dst_host/src_host, all of them are to cloudcephosd1004... xd [08:52:58] *s/all/a lot/ [08:54:03] hmm.... they have been failing for a while it seems, for a bunch of the osds [08:54:06] looking [08:55:16] this is also weird (from a `ceph status -w`) `2025-08-21T08:47:44.400957+0000 osd.71 [ERR] osd.71 found snap mapper error on pg 3.4d9 oid 3:9b28c323:::rbd_data.c5122a153c6eb1.0000000000000941:fd5a1 snaps in mapper: {}, oi: {fd5a1} ...repaired` [08:55:33] I'll open a task to dump things there [09:03:03] Created T402499 [09:03:04] T402499: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499 [09:11:50] there's some jumbo packet loss towards 1004 [09:11:53] https://usercontent.irccloud-cdn.com/file/LI1OeLoD/image.png [09:12:39] not from though https://usercontent.irccloud-cdn.com/file/twPZumqV/image.png [09:13:04] and there was packet loss for the new 104* nodes [09:15:29] the loss towards 1004 is mostly the cloudcephosd1043/44/47, the rest had some loss at ~3am UTC (reboot I guess) [09:17:45] sal is down? [09:18:53] hmm, sal is up now for me, there might be some instability [09:20:00] could it be the ongoing rolling reboot of nfs workers ? [09:20:26] maybe, it might cause some small interruption (moving the pod to a different worker node, specially if it only has 1 replica) [09:20:34] there's nothing on sal from this morning https://sal.toolforge.org/admin [09:21:04] oh yeah totally, looks like it is indeed busted [09:21:33] at least for admin, I ran a couple of vm_console earlier today [09:22:13] toolsbeta seem to work, let me try admin [09:22:41] it showed up [09:22:52] maybe it was busted while the issue was hapenning [09:23:00] could be yeah [10:09:35] it's weird, osd 65 ended up starting correctly and joining the cluster, 66 is still failing, it has an old osdmap it seems, looking [10:15:20] quick review: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6 [10:27:56] LGTM [10:28:31] thanks! [10:31:41] the ceph warning status is me, somehow the osd service I'm trying to get up and running was able to log a crash report (but did not do it before :/) [10:32:19] ack [10:52:04] * dcaro lunch [11:59:45] (from engineering-all): We are having a massive spam problem in paste.toolforge.org https://paste.toolforge.org/lists [12:05:21] that is unfortunately not a new thing, T189255 [12:05:22] T189255: paste.toolforge.org is continuously spammed - https://phabricator.wikimedia.org/T189255 [12:09:05] doesn't it have a captcha? [12:14:37] a pretty crap one if you look at it :) [12:14:47] xd [12:29:42] ok I'm off, see you next week! [12:32:19] cya! [13:16:20] I did my best to not leave things in a mess last night, did I fail? [13:42:31] that captcha is cute [13:47:08] andrewbogott: nono, everything was stable, just was looking to what's the current status [13:47:17] ok! [13:47:33] I'm working on a cookbook patch to check the version before setup [14:45:30] please remember to add your updates to the meeting etherpad [14:57:21] taavi: are you free to run the meeting or would you like me to? [14:57:29] I will be there [15:00:25] so I see! [15:53:52] dcaro, dhinus, here's my cookbook patch to guard against version mismatch https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1180878 [15:57:15] andrewbogott: LGTM, but I'd wait for a cross-check from dcaro before merging. [15:57:46] thx [15:59:37] +1d [16:00:19] andrewbogott: btw. are you setting the cluster to noin/norebalance? [16:00:38] (might be a leftover from me trying to add the last osd of 1004 xd) [16:01:09] that's me. I'm testing my patch and as it is now the cookbook leaves things in noin even if the initial checks fail [16:02:16] okok [16:02:26] hmm, maybe that's something to improve xd [16:02:52] yeah [16:03:05] although when the checks fail it's almost always followed up by a re-run which finishes [16:07:01] andrewbogott: I tried to look at the jumbo frames issues as a part of T401693 but I don't see the second network connections for 1043/4/7 documented in netbox at all [16:07:02] T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 [16:07:52] the OS says it's up... [16:08:04] I guess that means it's plugged into a switch but not entered or configured? [16:08:41] * andrewbogott trying to pool 1042 again, this time with the right ceph packages [16:08:51] * taavi looks at the running switch config [16:10:29] taavi: want me to ping in dcops? [16:12:22] andrewbogott: yeah, as far as I can tell at least 1043 does not have that second connection at all [16:17:07] taavi: can you respond to val about what you do or don't see for 1042 vs 1043? [16:22:52] andrewbogott: yes [16:23:06] ty! [16:32:10] there's a couple alerts, one for replicas and one for cloudnet, anyone looking into those? (I'll quickly check the replicas one) [16:32:31] oh, there's a silence ` Maintenance - fceratto@cumin1002` expired though [16:33:13] yes, and replication lag alerts are not for us to worry about [16:34:58] the cloudnet one does look interesting, I can't quickly figure out what's going on there [16:35:16] Aug 21 16:21:06 cloudnet1005 puppet-agent[702186]: (/Stage[main]/Systemd::Timesyncd/Package[systemd-timesyncd]/ensure) systemd-timesyncd : Depends: systemd but it is not going to be installed [16:35:20] uhh what is happening there [16:36:43] something upgraded opensshd, which broke things somehow [16:36:58] ack [16:37:02] great, so that will probably start happening everywhere in 10 minutes or so :/ [16:37:34] did you do something? [16:39:25] nope, I was assuming it was an unattended upgrade or similar [16:39:47] "something upgraded opensshd" sounds like the kind of thing that would happen fleet-wide, or not at all [16:41:29] wikiprod-realm hosts do not run unattended-upgrades, m.oritzm / IF take care of those manually with some specialised tooling [16:41:31] see -sre [16:42:28] 'or similar' == moritz probably did it :) [16:42:44] oops, sorry for the ping morit.z, disregard [16:43:51] in this case I did confirm that from the auth and sudo logs, but yes [17:25:57] dhinus: i don't know how you feel about reading partman recipes, but the one for those new boss-card systems is hwraid-1dev-nvme.cfg -- it's pretty simple [17:27:35] andrewbogott: I'm happy to take a look tomorrow, leave me some pointers in T402475 [17:27:35] T402475: KernelErrors - https://phabricator.wikimedia.org/T402475 [17:31:09] andrewbogott: Can you refresh my memory--Does instance resize work these days? I have a g4.cores2.ram4.disk20 instance that needs more ram apparently and would like to avoid a full rebuild at the moment. [17:31:34] yes [17:33:29] it should work fine. You need to confirm that things are still working post-resize and then hit the 'confirm' button in horizon after [17:38:53] it seems to have worked. I got an error message in Horizon that it failed, but I think that was from me double submitting the confirmation step. [18:00:52] andrewbogott: I'm sure you are working on other things, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163883 has been hanging out for a while. Should I look for someone else to review and merge? [18:04:06] I can look at it [18:16:23] thank you [18:30:35] * dcaro off [18:30:39] cya! o\ [18:30:42] \o xd