[03:55:21] <andrewbogott>	 In case there are more after-effects later, here's what just happened:
[03:55:21] <andrewbogott>	 - I added a new ceph node, cloudcephosd1042, with the cookbook. This went haywire because (due to a race condition in puppet) 1042 was running an old version of ceph, v14 (most of the cluster is running v16)
[03:55:21] <andrewbogott>	 - Somehow when the v14 client tried to talk to the one v17 client (on cloudcephosd1004) it caused the nodes on cloudcephosd1004 to crash
[03:55:21] <andrewbogott>	 - Again, 'somehow' that crash didn't just cause a rebalance, but caused a bunch of pgs to go read-only. Very weird behavior for only one node going down, but it happened.
[03:55:21] <andrewbogott>	 - That meant that for a few minutes ceph misbehaved badly enough that some VMs froze, and a lot of toolforge jobs flapped
[03:55:22] <andrewbogott>	 - As soon as I switched off 1042 and 1004, everything got better
[03:55:22] <andrewbogott>	 - I restarted some unhappy nfs worker nodes just in case (although I suspect they would've recovered on their own anyway)
[03:55:57] <andrewbogott>	 I'll have another go tomorrow now that i know to double-check the ceph version on new ods.
[03:56:42] <andrewbogott>	 Oh, also, when I reimaged ceph OSDs a lot of them raised kernel errors, which I'm confident is unrelated and just a side-effect of reimaging and rebooting.
[06:44:24] <godog>	 ack, good luck
[06:44:28] <godog>	 and greetings
[06:47:57] <godog>	 as heads up, I'll be off this afternoon and tomorrow all day
[06:56:54] <godog>	 I see a bunch of alerts for nfs workers, I take it the "fix" is to restart them via cookbook ?
[07:13:40] <taavi>	 yep, or to run the cookbook to reboot all the nfs workers which is usually the way to go after major ceph/nfs blips
[07:18:41] <godog>	 ok that'd be wmcs.toolforge.k8s.reboot + options
[07:18:58] <taavi>	 yeah, I don't remember the name off-hand but there's some flag to do all the nfs workers
[07:19:10] <godog>	 cheers
[07:19:39] <taavi>	 and it'll take a while to reboot all the workers, but that's fine
[07:20:56] <godog>	 ok I'm taking a few minutes to poke things around then will start the reboot
[07:21:01] <godog>	 the reboots even
[07:28:44] <godog>	 root@cloudcumin1001:~# cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --all-nfs-workers 
[07:28:47] <godog>	 FTR
[07:55:33] <dcaro>	 morning
[07:57:30] <dcaro>	 we can add a version check for the ceph cookbook to avoid bootstrapping a node if it has the wrong ceph version running
[07:57:56] <dcaro>	 (means that we will have to keep track of that version somewhere in the cookbooks)
[08:03:37] <dcaro>	 hmm.... the bootsrap_and_add cookbook already runs puppet, reboots and runs puppet again, that should have pulled in the newer packages before starting the osds
[08:04:28] <taavi>	 puppet ordering issue maybe?
[08:05:59] <dcaro>	 ohhh, I think that we don't upgrade the packages from puppet, so it keeps the ones it installed before having the thirdparty repo configured, let me check
[08:06:52] <taavi>	 in which case you could pull the expected ceph version from hiera instead of having to hardcode that in the cookbook
[08:08:31] <dcaro>	 yep, we don't enforce any version
[08:10:08] <dcaro>	 that breaks the upgrade process though, as it's done with cookbooks, not puppet
[08:18:15] <dcaro>	 on 1004 side, it killed itself as it detected a very old version for the peer `Aug 21 03:33:52 cloudcephosd1004 ceph-osd[7869]: ceph-osd: ./src/osd/PeeringState.cc:1255: bool PeeringState::check_prior_readable_down_osds(const OSDMapRef&): Assertion `HAVE_FEATURE(upacting_features, SERVER_OCTOPUS)' failed.
[08:18:15] <dcaro>	 `
[08:19:26] <dcaro>	 I'll try bringing it up again (that's the one with v17)
[08:22:05] <dhinus>	 morning, reading the backscroll...
[08:22:42] <dhinus>	 I will check the kernel errors just in case, and comment in T402475
[08:22:43] <stashbot>	 T402475: KernelErrors  - https://phabricator.wikimedia.org/T402475
[08:23:17] <dcaro>	 it failed :/
[08:45:31] <godog>	 I was looking at the wmcs ceph dashboards, what are the dashboards and timeframes that show the problem ?
[08:46:48] <godog>	 also FYI I'll be OOO this afternoon and tomorrow all day
[08:49:37] <dcaro>	 our ceph dashboards are all linked, here's the health one https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&from=now-6h&to=now&timezone=utc
[08:50:23] <godog>	 thank you dcaro !
[08:50:56] <dcaro>	 hmm... I'm seeing a suspiciously persistent loss jumbo frames/pings
[08:52:43] <dcaro>	 just split the lost jumbo frames graph by dst_host/src_host, all of them are to cloudcephosd1004... xd
[08:52:58] <dcaro>	 *s/all/a lot/
[08:54:03] <dcaro>	 hmm.... they have been failing for a while it seems, for a bunch of the osds
[08:54:06] <dcaro>	 looking
[08:55:16] <dcaro>	 this is also weird (from a `ceph status -w`) `2025-08-21T08:47:44.400957+0000 osd.71 [ERR] osd.71 found snap mapper error on pg 3.4d9 oid 3:9b28c323:::rbd_data.c5122a153c6eb1.0000000000000941:fd5a1 snaps in mapper: {}, oi: {fd5a1} ...repaired`
[08:55:33] <dcaro>	 I'll open a task to dump things there
[09:03:03] <dcaro>	 Created T402499
[09:03:04] <stashbot>	 T402499: [ceph] 2025-08-21 ceph issues bringing new osds up - https://phabricator.wikimedia.org/T402499
[09:11:50] <dcaro>	 there's some jumbo packet loss towards 1004
[09:11:53] <dcaro>	 https://usercontent.irccloud-cdn.com/file/LI1OeLoD/image.png
[09:12:39] <dcaro>	 not from though https://usercontent.irccloud-cdn.com/file/twPZumqV/image.png
[09:13:04] <dcaro>	 and there was packet loss for the new 104* nodes
[09:15:29] <dcaro>	 the loss towards 1004 is mostly the cloudcephosd1043/44/47, the rest had some loss at ~3am UTC (reboot I guess)
[09:17:45] <dcaro>	 sal is down?
[09:18:53] <dcaro>	 hmm, sal is up now for me, there might be some instability
[09:20:00] <godog>	 could it be the ongoing rolling reboot of nfs workers ?
[09:20:26] <dcaro>	 maybe, it might cause some small interruption (moving the pod to a different worker node, specially if it only has 1 replica)
[09:20:34] <dcaro>	 there's nothing on sal from this morning https://sal.toolforge.org/admin
[09:21:04] <godog>	 oh yeah totally, looks like it is indeed busted
[09:21:33] <godog>	 at least for admin, I ran a couple of vm_console earlier today
[09:22:13] <dcaro>	 toolsbeta seem to work, let me try admin
[09:22:41] <dcaro>	 it showed up
[09:22:52] <dcaro>	 maybe it was busted while the issue was hapenning
[09:23:00] <godog>	 could be yeah
[10:09:35] <dcaro>	 it's weird, osd 65 ended up starting correctly and joining the cluster, 66 is still failing, it has an old osdmap it seems, looking
[10:15:20] <dhinus>	 quick review: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tf-infra-test/-/merge_requests/6
[10:27:56] <godog>	 LGTM
[10:28:31] <dhinus>	 thanks!
[10:31:41] <dcaro>	 the ceph warning status is me, somehow the osd service I'm trying to get up and running was able to log a crash report (but did not do it before :/)
[10:32:19] <godog>	 ack
[10:52:04] * dcaro lunch
[11:59:45] <Amir1>	 (from engineering-all): We are having a massive spam problem in paste.toolforge.org https://paste.toolforge.org/lists
[12:05:21] <taavi>	 that is unfortunately not a new thing, T189255
[12:05:22] <stashbot>	 T189255: paste.toolforge.org is continuously spammed - https://phabricator.wikimedia.org/T189255
[12:09:05] <dcaro>	 doesn't it have a captcha?
[12:14:37] <Reedy>	 a pretty crap one if you look at it :)
[12:14:47] <dcaro>	 xd
[12:29:42] <godog>	 ok I'm off, see you next week!
[12:32:19] <dcaro>	 cya!
[13:16:20] <andrewbogott>	 I did my best to not leave things in a mess last night, did I fail?
[13:42:31] <Amir1>	 that captcha is cute
[13:47:08] <dcaro>	 andrewbogott: nono, everything was stable, just was looking to what's the current status
[13:47:17] <andrewbogott>	 ok!
[13:47:33] <andrewbogott>	 I'm working on a cookbook patch to check the version before setup
[14:45:30] <taavi>	 please remember to add your updates to the meeting etherpad
[14:57:21] <andrewbogott>	 taavi: are you free to run the meeting or would you like me to?
[14:57:29] <taavi>	 I will be there
[15:00:25] <andrewbogott>	 so I see!
[15:53:52] <andrewbogott>	 dcaro, dhinus, here's my cookbook patch to guard against version mismatch  https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1180878
[15:57:15] <dhinus>	 andrewbogott: LGTM, but I'd wait for a cross-check from dcaro before merging.
[15:57:46] <andrewbogott>	 thx
[15:59:37] <dcaro>	 +1d
[16:00:19] <dcaro>	 andrewbogott: btw. are you setting the cluster to noin/norebalance?
[16:00:38] <dcaro>	 (might be a leftover from me trying to add the last osd of 1004 xd)
[16:01:09] <andrewbogott>	 that's me. I'm testing my patch and as it is now the cookbook leaves things in noin even if the initial checks fail
[16:02:16] <dcaro>	 okok
[16:02:26] <dcaro>	 hmm, maybe that's something to improve xd
[16:02:52] <andrewbogott>	 yeah
[16:03:05] <andrewbogott>	 although when the checks fail it's almost always followed up by a re-run which finishes
[16:07:01] <taavi>	 andrewbogott: I tried to look at the jumbo frames issues as a part of T401693 but I don't see the second network connections for 1043/4/7 documented in netbox at all
[16:07:02] <stashbot>	 T401693: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693
[16:07:52] <andrewbogott>	 the OS says it's up...
[16:08:04] <andrewbogott>	 I guess that means it's plugged into a switch but not entered or configured?
[16:08:41] * andrewbogott trying to pool 1042 again, this time with the right ceph packages
[16:08:51] * taavi looks at the running switch config
[16:10:29] <andrewbogott>	 taavi: want me to ping in dcops?
[16:12:22] <taavi>	 andrewbogott: yeah, as far as I can tell at least 1043 does not have that second connection at all
[16:17:07] <andrewbogott>	 taavi: can you respond to val about what you do or don't see for 1042 vs 1043?
[16:22:52] <taavi>	 andrewbogott: yes
[16:23:06] <andrewbogott>	 ty!
[16:32:10] <dcaro>	 there's a couple alerts, one for replicas and one for cloudnet, anyone looking into those? (I'll quickly check the replicas one)
[16:32:31] <dcaro>	 oh, there's a silence ` Maintenance - fceratto@cumin1002` expired though
[16:33:13] <taavi>	 yes, and replication lag alerts are not for us to worry about
[16:34:58] <taavi>	 the cloudnet one does look interesting, I can't quickly figure out what's going on there
[16:35:16] <taavi>	 Aug 21 16:21:06 cloudnet1005 puppet-agent[702186]: (/Stage[main]/Systemd::Timesyncd/Package[systemd-timesyncd]/ensure)  systemd-timesyncd : Depends: systemd but it is not going to be installed
[16:35:20] <taavi>	 uhh what is happening there
[16:36:43] <taavi>	 something upgraded opensshd, which broke things somehow
[16:36:58] <dcaro>	 ack
[16:37:02] <andrewbogott>	 great, so that will probably start happening everywhere in 10 minutes or so :/
[16:37:34] <taavi>	 did you do something?
[16:39:25] <andrewbogott>	 nope, I was assuming it was an unattended upgrade or similar
[16:39:47] <andrewbogott>	 "something upgraded opensshd" sounds like the kind of thing that would happen fleet-wide, or not at all
[16:41:29] <taavi>	 wikiprod-realm hosts do not run unattended-upgrades, m.oritzm / IF take care of those manually with some specialised tooling
[16:41:31] <taavi>	 see -sre
[16:42:28] <andrewbogott>	 'or similar' == moritz probably did it :)
[16:42:44] <andrewbogott>	 oops, sorry for the ping morit.z, disregard
[16:43:51] <taavi>	 in this case I did confirm that from the auth and sudo logs, but yes
[17:25:57] <andrewbogott>	 dhinus: i don't know how you feel about reading partman recipes, but the one for those new boss-card systems is hwraid-1dev-nvme.cfg -- it's pretty simple
[17:27:35] <dhinus>	 andrewbogott: I'm happy to take a look tomorrow, leave me some pointers in T402475
[17:27:35] <stashbot>	 T402475: KernelErrors  - https://phabricator.wikimedia.org/T402475
[17:31:09] <bd808>	 andrewbogott: Can you refresh my memory--Does instance resize work these days? I have a g4.cores2.ram4.disk20 instance that needs more ram apparently and would like to avoid a full rebuild at the moment.
[17:31:34] <taavi>	 yes
[17:33:29] <andrewbogott>	 it should work fine. You need to confirm that things are still working post-resize and then hit the 'confirm' button in horizon after
[17:38:53] <bd808>	 it seems to have worked. I got an error message in Horizon that it failed, but I think that was from me double submitting the confirmation step.
[18:00:52] <bd808>	 andrewbogott: I'm  sure you are working on other things, but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1163883 has been hanging out for a while. Should I look for someone else to review and merge?
[18:04:06] <andrewbogott>	 I can look at it
[18:16:23] <bd808>	 thank you
[18:30:35] * dcaro off
[18:30:39] <dcaro>	 cya! o\
[18:30:42] <dcaro>	 \o xd