[08:05:36] morning! [08:10:22] o/ [08:22:57] (•‿•)ノ [08:44:29] o/ [08:48:43] \o [10:00:43] btullis: you around? [10:01:10] Oh yes, thanks for the ping. Was getting distracted :-) [10:01:21] dcaro: we need to talk about patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144583 proposed by btullis, which would need a restart of the ceph daemons [10:04:28] There is a question about whether we should try to turn off duplicate logging (files + syslog) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144583/comment/1b1a4d6b_de8ae0b1/ [10:05:16] My original intention was just to fix the logging to `/none` even if we still end up with duplicate logs from some daemons. [10:06:53] just asking, as the original intention of using none was to prevent ceph from logging into file and use only syslo [10:06:55] *syslog [10:07:09] though the `none` file has very little logs [10:07:17] (in cloud osd nodes at least) [10:08:41] this logfile looks weird `/var/log/ceph/ceph-volume-systemd.log` [10:09:33] `ceph-volume` is the only one logging under /var/log/ceph [10:09:36] OK, on cephosd100[1-5] we gets 4 lines added for each HTTP request served by radosgw. So it gets pretty big and there is no housekeeping on `/none` [10:10:57] interesting, our radosgws don't log anything there [10:11:05] (or under /var/log/ceph) [10:11:44] +1 from me though, if we see logs getting out of control we can always add that option later [10:12:31] hmm, we get the radosgw logs into syslog, but nowhere else it seems [10:12:38] Cool. We also get the rados logs in `/var/log/ceph/radosgw/radosgw.log`so they are duplicates. [10:13:09] :/, weird [10:13:19] no custom config in `ceph config dump` [10:13:27] (for logging) [10:13:46] do you have syslog enabled? [10:14:32] Agree. It is weird, but that's why I'm cautious only to take little steps. I also think that we can come back and improve this again, to reduce logging confusion. Maybe it will be easier if we can get the version numbers in sync, too. [10:15:13] https://www.irccloud.com/pastebin/VYk7Lx5P/ [10:15:58] Broadly the same `[global]`config file as your clusters, I believe. [10:16:18] yep [10:17:00] So I'm going to merge then. Do you want to pause puppet on any of your clusters while I do a rolling restart on cephosd100[1-5]? Or are you happy to let puppet roll out the change? [10:17:41] it will not restart the daemons even if the config changes right? [10:18:57] * btullis double checks [10:20:32] I think that the radosgw service might restart, because it subscribes. https://github.com/wikimedia/operations-puppet/blob/production/modules/ceph/manifests/radosgw.pp#L11 [10:21:06] Same with mgr services: https://github.com/wikimedia/operations-puppet/blob/production/modules/ceph/manifests/mgr.pp#L36 [10:21:23] osds do not [10:21:39] https://www.irccloud.com/pastebin/MGo3dQvr/ [10:21:53] I'm ok with that :) [10:22:21] Cool. I think that mon services won't either. https://github.com/wikimedia/operations-puppet/blob/production/modules/ceph/manifests/mon.pp#L74 They require, but do not subscribe. [10:23:40] Proceeding. [10:24:16] 👍 [10:31:13] Confirmed. `ceph-rados@radosgw` and `ceph-mgr@cephosd1001` and `ceph-mds@cephosd1001` services all restarted on the test host. `ceph-osd` and `ceph-mon` services did not. [10:32:38] I am still getting some messages coming through to the `/none` file, but they are all from OSD processes. I hadn't noticed them before. [10:36:47] hmm, did you restart the ceph-osd manually? [10:38:56] Not yet. Running puppet on the other 4 hosts first. [10:41:55] ack, okok, so those logs are from the "old config osd" [10:45:45] Yeah, I think so. The cookbook doesn't seem to like my arguments to restart only the osd and mon services. [10:45:55] https://www.irccloud.com/pastebin/ihdTta6g/ [10:46:20] I'll just tell it to restart everything. [10:46:26] xd [10:46:41] maybe it expects several like `--daemons mon --daemons osd` [10:48:01] Yeah, anyway, rolling restart under way with `sudo cookbook sre.ceph.roll-restart-reboot-server --alias cephosd --reason "T384322 stop logging to /none" restart_daemons` [10:48:02] T384322: Ceph radosgw processes are logging to a file named `/none` on cephosd* servers - https://phabricator.wikimedia.org/T384322 [10:49:44] Sleeping for 300 seconds between hosts. I could have lowered that. [10:50:29] Still, health is OK for now. [11:01:42] I'll go for lunch (have another meeting after), I'll keep an eye too [11:01:48] restarted a couple manually and it went ok [11:03:42] Cool, thanks. [11:14:48] All looks good here. Cookbook completed. Cluster heath green. No new entries in `/none` from any daemons. Thanks all. Hope it goes as well for your restarts. [11:59:42] arturo: I'm still half asleep but https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145870 is probably worth applying [12:01:47] andrewbogott: LGTM. [12:02:06] Want me to leave your hotfixes in place and apply that manually? [12:08:26] hotfixed and also merged. Although I fear that this problem is just intermittent and I'm fooling myself [12:24:42] I'm in a meeting, but my hotfixes were reverted earlier [12:24:54] so feel free to just merge yours [12:32:27] I'm restarting some osds, the cluster will be in warning for a second [12:32:34] (ceph stuff) [12:33:50] btullis: that change made our osds start logging under /var/log/ceph [12:35:10] maybe they were trying to log in /root/none but had no permissions xd [12:35:22] Oh, ok. Yes, I think ours too, probably. [12:36:12] Lots of puppet issues (non-ceph related xd), looking [12:36:48] It look like the new files are picked up by `/etc/logrotate.d/ceph-common` so maybe the thing to do is to stop logging to syslog/journal to avoid duplication? Not urgent, just a thought. [12:37:24] we log to syslog as it then gets sent to logstash [12:37:43] the puppet stuff might be a false positive :/, looking [12:40:59] hmm... manual run on tools-k8s-etcd-22 (that was flagged as failing) worked ok [12:41:06] same for metricsinfra-control-2 [12:41:56] oh, it was an sssd change [12:42:01] https://www.irccloud.com/pastebin/phUAUHbl/ [12:42:54] andrewbogott: ^ was that you? [12:43:51] I think so xd `May 14 12:09:43 tools-k8s-etcd-22 puppet-agent[2721288]: Applying configuration version '(08197a0510) Andrew Bogott - sssd.conf: add more timeout settings'`, okok, so it will pas, I'll ack everything [12:49:32] expect a tools-db alert as I'm working on T393766 [12:49:33] T393766: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-05-09 - https://phabricator.wikimedia.org/T393766 [12:51:38] anyone knows what's up with cloudnet2006-dev? (I can't ssh) [12:53:37] huh, didn't we have codfw1dev servers going unresponsible yesterday as well? [12:53:48] also anyone looking at the WidespreadPuppetAgentFailure alerts? [12:53:55] I'll look at the cloudnet [12:59:47] we may need to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145870 [13:00:02] andrewbogott: did you check that the new syntax is correct? [13:02:11] at some point the sssd puppetization was always failing to restart it because of weird systemd unit dependencies, not sure if still the case [13:02:35] the issue only happens when trying to restart it, then it starts up by itself [13:02:40] (and puppet starts passing) [13:02:43] the codfw1dev host lockups are https://phabricator.wikimedia.org/T393366, something must've installed the new kernel [13:04:07] that might affect eqiad too :/ [13:05:29] looking at https://debmonitor.wikimedia.org/kernels/5_1-smp-preempt_dynamic-debian-61135-1-2025-04-25, cloudvirt1075 and cloudvirt1076 are the only non-dev cloud hosts with that kernel [13:05:52] but I wonder how the kernel got there in the first place, https://phabricator.wikimedia.org/T393366 says it was manually uninstalled [13:06:04] I guess the epoxy upgrade cookbooks could have accidentally installed that? [13:06:15] probably [13:06:24] (sounds likely to me that they might) [13:06:36] I'm going to reboot cloudnet2006-dev back to the working kernel while I'm already logged in via mgmt [13:08:04] this is the second time that cloudnet2006-net has locked up like that so it may or may not be the kernel [13:08:10] but I guess we'll see if it keeps happening [13:08:21] meanwhile... I'll restart sssd cloud-wide [13:09:25] andrewbogott: I think sssd restarts by itself [13:09:44] dcaro: last time around puppet couldn't restart it. Seems ok this time for some reason [13:09:54] Guess I'll leave it alone until we have a problem. [13:09:58] could someone make a task about the puppet restart failures? we should eventually fix the manifests to make it restart properly [13:10:44] andrewbogott: at least the few VMs I checked it had restarted by itself [13:10:49] taavi: since it didn't happen just now when puppet changed sssd.conf... it may not be a recurring issue [13:10:55] dcaro: yeah, same with the one I'm looking at [13:11:09] then what were those puppet failure alerts about? [13:11:44] https://www.irccloud.com/pastebin/YLyfOo3f/ [13:11:49] Sorry, I'm playing catch-up, can you give me an example? [13:12:06] I think it restarts itself often, there's some interaction with the parent sssd service I think [13:13:47] https://phabricator.wikimedia.org/P76147 [13:14:36] yep, the `sssd-nss` fails to restart from puppet, but I think it's because `sssd` manages it [13:15:10] anyhow, needs investigation as you say [13:15:33] i'll file a task [13:15:56] taavi: yesterday when puppet touched sssd.conf puppet runs failed until I manually restarted sssd. I don't see that happening today. [13:16:06] I don't mean that there's no issue, only that it seems weirdly different [13:16:18] yes, but Puppet runs still shouldn't fail when touching the config, even if it'll recover by itself [13:16:49] ah, you're saying this time around it failed on the first run and then worked subsequently? [13:18:23] T394304 [13:18:24] T394304: Make Puppet able to reliabily restart sssd - https://phabricator.wikimedia.org/T394304 [14:19:42] hmm, I can't move tasks in bulk anymore :/ [14:19:58] can't say I love that OSPF alert, although it also says everything is up... [14:20:11] which alert? [14:20:34] it says ` OSPFv3: 0/1 UP` in the end [14:21:01] https://usercontent.irccloud-cdn.com/file/xxvxKVU2/image.png [14:22:30] you're right [14:23:01] topranks: ^ seems bad [14:23:23] The same thing alerted in codfw1dev yesterday but cleared after I rebooted a cloudnet (possibly coincidentally) [14:23:34] that would be a coincidence yes [14:23:38] I presume a link has failed [14:24:02] it seems ok though [14:24:04] right now [14:24:17] https://www.irccloud.com/pastebin/F11LqM0S/ [14:24:40] FWIW ospf on cloudsw2-d5 is only used to export the loopback of that switch for monitoring, so OSPF isn't used here for forwarding cloud traffic as such [14:24:55] however it failing probably suggests an issue with a link, which will certainly affect server traffic [14:25:36] is 'link' in this context a hw thing? [14:25:56] yeah like a connection between the switches [14:26:40] so the switch has not logged any ospf issue today [14:26:46] or yesterday fwiw [14:27:17] dhinus: can you add me to https://phabricator.wikimedia.org/project/profile/13/ ? [14:27:35] dcaro: done [14:27:40] thanks! :) [14:27:58] yeah it's been up for the past 4 weeks [14:28:03] cmooney@cloudsw2-d5-eqiad> show ospf neighbor detail [14:28:03] Address Interface State ID Pri Dead [14:28:03] 10.64.147.10 irb.1126 Full 10.64.146.253 128 35 [14:28:03] Area 0.0.0.0, opt 0x52, DR 0.0.0.0, BDR 0.0.0.0 [14:28:03] Up 4w2d 03:38:26, adjacent 4w2d 03:38:26 [14:28:29] so perhaps something going wrong with the Icinga check [14:28:38] Does it mean something that it thinks v2 is up but v3 down? [14:29:00] I pasted the wrong one, v3 is also up that length of time [14:29:04] https://www.irccloud.com/pastebin/5WGLi2ut/ [14:29:17] OSPFv2 is used for IPv4, OSPFv3 is used for IPv6 [14:29:37] ^ good to know [14:29:42] typically they will share fate, like I say they ought to "just work" unless the link between switches goes down [14:32:15] https://www.irccloud.com/pastebin/zXyJ4pyL/ [14:33:29] topranks: so, ignore for now and see if the alert continues to flap? [14:36:58] lol yeah ignore, but I'm trying to get to the bottom of it [14:37:07] Or send dcops folks to go wiggle the cables? [14:37:10] ok, thank you for digging [14:37:15] the lol is cos I'm looking at the OSFPV3 MIB, it defines these status levels: [14:37:21] https://www.irccloud.com/pastebin/209f92S6/ [14:37:37] cloudsw2-d5 is returning this: [14:37:38] OSPFV3-MIB::ospfv3NbrState.2.0.172004093 = INTEGER: 9 [14:38:00] as-in the defined states are 1-8, and for some reason it's saying the state is 9 [14:38:02] you gotta laugh [14:38:04] secret 9th thing [14:38:12] exactly. what are you guys up to? [14:38:52] as far as I know nothing new/interesting is happening in eqiad1 this week, although dcops might have racked some new cloudvirts. [14:38:53] I'll try bouncing the adjacency see does it clear, might be some rare and esoteric juniper bug [14:48:05] I'm gonna delete OSPF3 completely off the switch and re-add it, alert might fire for cloudsw1-d5 but we can ignore [14:49:24] * andrewbogott braces for impact [14:58:45] nah it's fine, like I say the OSPF there is only so we can ping the device loopback [14:59:08] I'm seriously stumped tbh, I've reset it, deleted it, restarted snmpd you name it [14:59:14] it stubbornly wants to be '9' [14:59:35] Any chance you're holding your laptop upside-down? [15:00:20] wait a second... [15:00:42] no false alarm. I'm standing the right way up. [15:02:29] Cathal's workspace: https://miro.medium.com/v2/resize:fit:1400/format:webp/1*_HHTR1yzMyYdZPHh0SeNpw.png [15:03:06] hahaha [15:05:21] happy wikiversary topranks [15:05:34] and jobo! [15:06:08] thanks <3 [15:06:16] congrats! [15:06:18] Thanks :) [15:08:07] andrewbogott: I've silenced the alert long-term in Icinga, like I say OSPF on that box is not important at all. Probably we need to upgrade JunOS or reboot to resolve the bug but that might be a bit drastic just for this, we can ignore for now. [15:08:20] works for me! thanks for investigating [15:53:07] dcaro: did you form a theory about why cloudnet2006-dev locked up? Since it has happened twice I'm wondering if it should block upgrading eqiad1 to epoxy (which codfw1dev is running) [15:54:01] andrewbogott: I have not checked, but the guess was that the upgrade process for epoxy pulled the new kernel (that was pulled before also, that's why it broke the first time) [15:54:24] congrats to the wikiversaryees! [15:54:32] is that kernel known to be broken? [15:56:55] yep [15:57:28] https://phabricator.wikimedia.org/T393366 (from taavi's earlier) [16:01:16] ok. So in theory if I wait for 6.1.137-1 to be release before upgrading anything else we'll skip over the cursed kernel without extra effort. [16:04:52] * andrewbogott schedules the upgrade for Tuesday [16:20:54] 🤞 yep [16:51:07] fyi toolsdb replica is catching up, should be in sync in a few hours. more details at T393766 [16:51:08] T393766: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-05-09 - https://phabricator.wikimedia.org/T393766 [16:52:18] \o/ [16:54:04] oh wow... double gtids... [17:33:26] arturo: where would be a good place to write down that idea we had at the Hackathon about encoding the Toolforge tool group number in IPv6 for Toolforge k8s? [17:43:17] * dcaro off [17:44:08] bd808: potentially T380060? [17:44:09] T380060: Support IPv6 in Toolforge Kubernetes - https://phabricator.wikimedia.org/T380060 [18:09:51] Added at https://phabricator.wikimedia.org/T380060#10823558 [21:13:45] bd808: I have been thinking on the actual implementation, and hopefully I will have time in the next few days to write my idea down in tha ticket. How to convince k8s to do that will be "fun" [21:15:33] arturo: awesome :)