[07:04:33] <godog>	 greetings
[07:57:34] <dcaro>	 morning!
[08:16:21] <godog>	 re: nova-compute pages from yesterday, I got https://gerrit.wikimedia.org/r/c/operations/alerts/+/1182034 as a proposal for higher level alerting, to be moved to paging, please let me know what you think!
[08:17:20] <godog>	 re: ceph slow ops from yesterday, if I'm reading https://grafana.wikimedia.org/goto/PXVoTLXHR?orgId=1 correctly it looks like we've asked ceph to move ~14x the data we normally do we commissioning an osd ?
[08:17:46] <godog>	 more like ~13x but you get the idea
[08:42:08] <dcaro>	 the amount of data that ceph shuffles when adding a node might not be the same every time (depends on the rebalancing and the new shape of the cluster if things have to be shuffled around), but such a big difference is weird yep
[08:42:41] <dcaro>	 we had one such spike in the past
[08:42:42] <dcaro>	 https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?from=now-7d&orgId=1&timezone=utc&to=now&viewPanel=panel-124
[08:42:45] <dcaro>	 (in the past week)
[08:44:28] <dcaro>	 I'm a bit undecided about if making the nova alert page or not xd
[08:44:29] <godog>	 nice find, and indeed I see CephSlowOps around the same time
[08:45:10] <dcaro>	 as in, it's ok if it's down for a bit, but should probably not be down for a whole weekend
[08:45:59] <godog>	 yeah, I'll change it to paging and extend the 'for' at least for now
[08:47:34] <dcaro>	 +1
[08:47:45] <godog>	 re: ceph rebalance I'm checking undrain_node and it asked ceph to pool all cloudcephosd1048 OSDs at the same time with full weight
[08:47:48] <dcaro>	 we can discuss in the team meeting and decide something there
[08:47:50] <godog>	 so I guess expected
[08:47:50] <dcaro>	 (for the page)
[08:48:03] <godog>	 2025-08-25 17:31:08,226 andrew 4037863 [INFO] [0/9 osds] Undraining osd batch 1: [OSDIdNode(osd_id=185, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=184, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=183, node_ fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=175, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=174, 
[08:48:05] <dcaro>	 I was going to check that :)
[08:48:08] <godog>	 node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=173, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdN ode(osd_id=172, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=171, node_fqdn='cloudcephosd1048.eqiad.wmnet'), OSDIdNode(osd_id=170, node_fqdn='cloudcephosd1048.eqiad.wmnet')] 
[08:48:13] <godog>	 sigh ok
[08:48:15] <dcaro>	 that might be a lot yep
[08:48:15] <godog>	 you get the idea
[08:48:38] <dcaro>	 by default iirc we were pooling 4 at a time? and these are bigger drives too
[08:48:44] <dcaro>	 so probably that's why the reshuffle spike
[08:49:03] <godog>	 yeah that lines up pretty well, definitely too much reshuffling at the same time
[08:49:26] <dcaro>	 can you try to see if the previous did the same? as in all at the same time? (the previous did not have that spike)
[08:50:02] <godog>	 this one: 2025-08-25 14:28:52,559 andrew 4019712 [INFO] [0/1 osds] Undraining osd batch 1: [OSDIdNode(osd_id=340, node_fqdn='cloudcephosd1047.eqiad.wmnet')] 
[08:50:23] <godog>	 yeah one osd at a time no problem, all at the same time not so much
[08:50:31] <dcaro>	 after setting up QoS in the network, we were not sure if saturating the switches would cause issues, so I guess this tells us that it's still an issue
[08:50:51] <dcaro>	 (and probably that heartbeat/mon traffic at high priority is not enough for ceph to avoid slow ops)
[08:51:16] <dcaro>	 we did not see the heartbeat issues we saw the previous time (nodes were flagging other nodes as dead all around), so that's an improvement
[08:52:10] <godog>	 yeah an improvement for sure, I'd imagine some osds getting overloaded by the big shuffle
[08:54:47] <dcaro>	 we were looking at throttling options also yesterday, that might help avoiding saturating the network, though will slow down recovery/refill (better slow than breaking though xd)
[08:56:51] <godog>	 hehe for sure
[08:57:15] <godog>	 do you know if there's a breakdown of slow ops somewhere in metrics ?
[08:57:57] <dcaro>	 https://www.irccloud.com/pastebin/n9RcWQfA/
[08:58:24] <dcaro>	 the timeline of the undrains, two big ones matching the spikes yep
[08:59:06] <dcaro>	 for the slow ops no, ceph did not seem to export that info directly, but I wanted to extract it anyhow, I think there's a task somewhere (looking)
[09:02:47] <dcaro>	 different task, maybe I did not create it https://phabricator.wikimedia.org/T348716
[09:02:53] <dcaro>	 T348716
[09:02:54] <stashbot>	 T348716: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716
[09:03:09] <godog>	 ack, thank you
[09:33:30] <topranks>	 Guys just having a look at the graphs for network usage last night.
[09:33:51] <topranks>	 First observation is the only core links that maxed out were the connections from cloudsw1-c8 to the switches in racks E4 and F4 
[09:33:52] <topranks>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-c8-eqiad:9804&var-interface=et-0%2F0%2F52&var-interface=et-0%2F0%2F53&refresh=30s
[09:34:37] <topranks>	 ^^ what's also significant here is that the bursts we're fairly short, and we have no "tail drops" (meaning total saturation) just "red drops" (meaning queues approaching saturation and drops are done to signal to TCP to backoff, which seems to have happened)
[09:35:05] <topranks>	 2) a few individual cloudcephosd* nodes maxed their 10G interfaces briefly at various stages 
[09:35:23] <dcaro>	 godog: found ceph_healthcheck_slow_ops, looks promising
[09:36:00] <topranks>	 3) I didn't see tail drops anywhere, which suggests to me the network was running at it's max, but it wasn't completely overloaded for any length of time 
[09:36:12] <dcaro>	 topranks: do you know which cloudcephosd* nodes? because the ones that reported slow ops were not the new node, but two other ones
[09:37:02] <topranks>	 4) none of the 25G hosts got anywhere close to that, largest was 4Gb/sec transmit from cloudcephosd1049 
[09:37:08] <topranks>	 https://grafana-rw.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/network-device-interface-throughput?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-e4-eqiad:9804&var-interface=et-0%2F0%2F16&var-interface=et-0%2F0%2F17&var-interface=et-0%2F0%2F18&var-interface=et-0%2F0%2F19&refresh=30s
[09:37:13] <topranks>	 https://grafana-rw.wikimedia.org/d/d968a627-b6f6-47fc-9316-e058854a4945/network-device-interface-throughput?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-f4-eqiad:9804&var-interface=et-0%2F0%2F10&var-interface=et-0%2F0%2F11&var-interface=et-0%2F0%2F8&var-interface=et-0%2F0%2F9&refresh=30s
[09:39:44] <dcaro>	 topranks: I see both queues (0 and 4) doing red drops, shouldn't QoS prioritize queue 0? (so I'd expect to start the red drops later?)
[09:39:52] <dcaro>	 (/me trying to understand QoS xd)
[09:40:36] <topranks>	 on which host?  0 (normal) and 3 (low) should have the drops....  
[09:40:57] <topranks>	 it's not impossible in 4 (the heartbeats) but I didn't spot those myself 
[09:41:07] <dcaro>	 the switch you linked to
[09:41:11] <topranks>	 the QoS is based on a percentage scheduler, not queue is completely starved 
[09:41:22] <topranks>	 i.e. even if everything is maxed we do not drop all Ceph traffic 
[09:41:30] <dcaro>	 Does this means that for short bursts it dropped all the packages? https://usercontent.irccloud-cdn.com/file/Wrxym3y3/image.png
[09:41:51] <dcaro>	 (the purple there matches the red)
[09:42:03] <dcaro>	 oh, there's two axes
[09:42:11] <dcaro>	 so it's not a scale ack
[09:42:28] <topranks>	 can you send me a link to it?  the purple spikes are some drops, the legend only shows the transmitted packets so can't see what queue they correspond with 
[09:42:37] <topranks>	 drop levels are high though - that access is in kilopps 
[09:42:40] <topranks>	 *axis 
[09:43:00] <topranks>	 but it's a high bw link I guess and was totally maxed 
[09:43:38] <dcaro>	 from that same dashboard https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-c8-eqiad:9804&var-interface=et-0%2F0%2F52&var-interface=et-0%2F0%2F53&refresh=30s
[09:43:47] <dcaro>	 a bit down
[09:44:26] <topranks>	 yeah so no drops in queue 4 on that, just 0 and 3 which is what we would expect 
[09:45:17] <dcaro>	 yep, it's a couple orders of magnitude different between tx packets and dropped ones 👍
[09:45:49] <topranks>	 yes and they are quick bursts and then it drops off (which is the idea of red drops - signal to the sender to slow - so looks like that happened) 
[09:48:03] <topranks>	 these look to have been the busiest hosts during the time (not these stats are from the hosts themselves)
[09:48:07] <topranks>	 https://usercontent.irccloud-cdn.com/file/k2FlNHhA/image.png
[09:49:48] <dcaro>	 1040 was one of the two that reported slow ops
[09:49:49] <topranks>	 anyway I can't say for sure the dropped packets did not cause some problems, but I'm kind of leaning on thinking the core did not cause significant problems here
[09:50:54] <topranks>	 and thus I'm not so sure if we had another 1Tb of bandwidth if things would have been different.  the saturation to individual hosts - or even just high transmit rate to hosts that weren't completely maxed - may have been the cause 
[09:51:12] <topranks>	 dcaro: what was the other one with slow ops?
[09:51:58] <dcaro>	 cloudcephosd1015
[09:53:04] <topranks>	 some small amount of drops network-side going to 1040, but very marginal (10-15 packets/sec dropped out of 100,000pps total)
[09:53:05] <topranks>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-e4-eqiad:9804&var-interface=xe-0%2F0%2F4&var-interface=xe-0%2F0%2F5&refresh=30s
[09:54:38] <topranks>	 1015 had more significant drops, but interestingly they were all due to microbursts 
[09:54:39] <topranks>	 https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-d5-eqiad:9804&var-interface=xe-0%2F0%2F15&var-interface=xe-0%2F0%2F16&refresh=30s
[09:55:02] <topranks>	 by which I mean the average rate out to 1015 never got above about 5Gb/sec, yet we still see drops 
[09:55:37] <topranks>	 which means that we had instantaneous moments where the buffers were getting full, but the traffic was very bursty, it wasn't sustained for even 30 seconds 
[09:56:42] <dcaro>	 would that be related to jumbo frames in any way? (could it be improved by not using jumbo?)
[09:57:34] <topranks>	 no I don't think so 
[09:57:54] <dcaro>	 ack, thanks for explaining the micro-bursts, I was going to ask :)
[09:59:07] <topranks>	 the 25G hosts can contribute to it - obviously they can send a packet 2.5 times faster than the 10G hosts can receive it - so buffering is required.  If one of those sends a steady burst at 25G a second (even if it's only for 3 seconds or whatever) they'd be buffered as the receiving switch serialises them at 10G out to the destination 
[10:00:19] <topranks>	 very quick bursts of traffic like that can often happen before the flow-control mechanism in TCP can react 
[10:00:49] <topranks>	 with a longer flow the sender will react to the red drops by reducing its tx rate 
[10:02:10] <dcaro>	 hmm... from the pannel we have for ceph performance, I see that cloudsw1-e4-eqiad has a burst of up to almos 40GB/s https://usercontent.irccloud-cdn.com/file/BQIB2xSL/image.png
[10:02:32] <dcaro>	 from https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?orgId=1&from=2025-08-25T17:20:37.030Z&to=2025-08-25T17:44:16.332Z&timezone=utc
[10:02:51] <dcaro>	 uses `rate(gnmi_interfaces_interface_state_counters_in_octets{instance=~"cloudsw1.*-e4-.*",interface_description=~"(Core|Transport): cloudsw.*", interface_name="et-0/0/55"}[5m]) * 8`
[10:02:57] <topranks>	 yeah that's one of the two links we seen max out 
[10:03:29] <dcaro>	 I don't see it here though https://grafana-rw.wikimedia.org/d/5p97dAASz/network-device-interface-queues-and-error-stats?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-device=cloudsw1-e4-eqiad:9804&var-interface=et-0%2F0%2F55&refresh=30s
[10:03:56] <topranks>	 https://grafana.wikimedia.org/goto/LRd-YYuNg?orgId=1
[10:04:48] <topranks>	 ^^ the queue stats only show outbound traffic - so you see it on the stats for cloudsw1-c8-eqiad 
[10:05:01] <dcaro>	 achj xd 
[10:05:52] <topranks>	 the stats from switch in E4 are here:
[10:05:52] <topranks>	 https://grafana.wikimedia.org/goto/b0aPLLXHR?orgId=1
[10:06:27] <dcaro>	 the drops are shared between in/out?
[10:07:55] <topranks>	 there are no drops inbound.  the interface is 10G, 25G, 40G etc., everything it receives it will try to forward, nothing is dropped coming in.
[10:08:16] <topranks>	 If more traffic needs to go OUT  a given port than it has the bandwidth for we get drops 
[10:09:13] <topranks>	 so - for example - for that link you see the drop-rate outbound on cloudsw1-c8 to work out what's happening 
[10:09:31] <topranks>	 we see the link maxed at 40G on the other side on cloudsw1-e4 of course, but we don't see drops on that one 
[10:09:34] <dcaro>	 huh, that's different than linux then, there you can saturate the input buffer of the node right?, interesting
[10:10:05] <dcaro>	 gtg, but this is quite interesting, we should probably put all in a task at least
[10:10:08] <topranks>	 yes it's an ASIC doing the forwarding with only one potential job - switch the frame out to another interface - which is can do at line rate 
[10:10:55] <dcaro>	 thanks topranks for the explanations!
[10:10:57] <topranks>	 on Linux the kernel has a whole heap of software buffers the CPU needs to deal with + whatever the actual application has to do to read frames off the buffer.  Plus your bitcoin miner using up all the resources :P 
[10:17:55] <taavi>	 huh, why is tools-harbor-1 still sending emails? I thought it was shut off yesterday
[10:24:51] <dhinus>	 taavi: uptime 200 days, so I guess it was not :)
[10:25:00] <dhinus>	 I will shut it off now
[10:29:12] <dcaro>	 Hmmm... What did I shut off yesterday?
[11:06:36] <dhinus>	 hmm now the alert is about tools-harbor-1 being down :)
[11:07:16] <taavi>	 which should clear as soon as Prometheus notices the instance is SHUTDOWN :-)
[11:07:50] <dhinus>	 ack
[11:09:12] <dhinus>	 hmm it's still Active, I did shutdown from the OS, but maybe I need to "shut off" in openstack
[11:11:05] <dhinus>	 done
[11:15:56] <dcaro>	 I shat it off from open stack, maybe the os ignored it
[11:16:37] <dcaro>	 Hahahah, shat XD, *shut
[11:31:43] <godog>	 lolz
[11:43:49] <dcaro>	 no alerts \o/!
[12:54:56] <dcaro>	 We never enabled virtualization on our hypervisors right?
[12:55:26] <andrewbogott>	 dcaro: do you mean the bios flag? It's definitely turned on on all cloudvirst
[12:55:40] <andrewbogott>	 without it kvm doesn't really work at all
[12:55:45] <dcaro>	 and the module/settings?
[12:55:58] <dcaro>	 (I want to try building the lima-kilo VM inside a VM)
[12:56:34] <andrewbogott>	 ah, you're talking about virt-on-virt -- I think that's not enabled. I can dig for the history of that if you want.
[13:03:55] * andrewbogott digging but not finding
[13:06:33] <andrewbogott>	 here it is:  T267433 and T276208 (enabling virt on virt breaks live migration)
[13:06:34] <stashbot>	 T267433: Enable support for nested VMs - https://phabricator.wikimedia.org/T267433
[13:06:34] <stashbot>	 T276208: cloud: libvirt doesn't support live migration when using nested KVM - https://phabricator.wikimedia.org/T276208
[13:14:11] <dcaro>	 trying to do a pull + push using the robot account ends up in 500 error 
[13:14:13] <dcaro>	 Error: writing manifest: uploading manifest latest to tools-harbor.wmcloud.org/tool-wm-lol/tool-wm-lol: received unexpected HTTP status: 500 Internal Server Error
[13:16:29] <dcaro>	 this does not say much :/
[13:16:32] <dcaro>	 172.21.0.7 - - [26/Aug/2025:13:13:53 +0000] "POST /v2/tool-wm-lol/tool-wm-lol/blobs/uploads/ HTTP/1.1" 500 123 "" "go-containerregistry/v0.16.1"
[13:30:02] <dcaro>	 andrewbogott: do you know how to get the logs from s3/radosgw?
[13:31:26] <andrewbogott>	 journalctl -u ceph-radosgw@radosgw.service on a cloudcontrol should get you some of them at least
[13:31:32] <dcaro>	 ack
[13:39:13] <dcaro>	 how do I check openstack container quotas?
[13:40:02] <dhinus>	 dcaro: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Projects_lifecycle#swift_/_S3_/_radosgw_/_object_storage
[13:40:25] <dcaro>	 oh, quotas are directly on rados, not openstack
[13:41:08] <andrewbogott>	 yeah, all openstack does is discovery and auth
[13:42:18] <dcaro>	 yep that was it
[13:42:57] <andrewbogott>	 it returns 500 if you're over quota? rude
[13:43:59] <dcaro>	 yep, I think there might be a bug around somewhere handling the out of quota
[13:45:05] <dcaro>	 we should surface that quota somewhere probably, as users will not be able to know what's the issue
[13:48:17] <dcaro>	 do we have a task for that? (exposing their rados-gw usage to users)
[13:48:55] <andrewbogott>	 I don't think so
[13:54:12] <dcaro>	 are we even gathering that data?
[13:54:20] <dcaro>	 (/me looks in grafana)
[13:55:28] <andrewbogott>	 I don't know that we're tracking rados other than the bulk usage of the ceph pool
[13:58:58] <dcaro>	 yep, we might not be getting rados stats
[14:52:51] <jclark-ctr>	 @dcaro @andrewbogott  seed server has been imaged for T348643  is there anything else with this ticket once its looked over?  or will we be able to close this ticket? 
[15:06:59] <dcaro>	 T348643
[15:07:02] <dcaro>	 stashbot?
[15:07:02] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[15:07:53] <bd808>	 dcaro: that task is restricted
[15:08:03] <dcaro>	 aaahhh, xd
[15:16:53] * andrewbogott wonders why the nic names are so different on 1052 vs 1051
[15:24:12] <andrewbogott>	 ah, for some reason that one has an intel network card instead of broadcom
[15:25:17] <dcaro>	 hmm... is it using the motherboard card or something?
[15:27:11] <andrewbogott>	 I looked at the delivery manifest, it has 'Intel E810-XXVDA4 Quad Port 10/25GbE SFP28
[15:27:11] <andrewbogott>	 Adapter, OCP NIC 3.0 '
[15:28:10] <andrewbogott>	 so I think it's all fine, just different hardware = different interface names in the OS
[15:28:38] <dcaro>	 ack
[15:29:00] <andrewbogott>	 it'll be a while before we can put that host into service but it looks good for the moment jclark-ctr 
[15:29:02] <dcaro>	 yep, that looks like a good card
[15:29:29] <andrewbogott>	 I'm still unclear if the jumbo frame enabling needs to happen on the switch or on the host and how to tell
[15:31:32] <dcaro>	 it needs both
[15:31:55] <dcaro>	 the host does it with puppet, setting it in the /etc/system/network/* files (iirc), the switch side is in netbox iirc
[15:33:28] <andrewbogott>	 ok, so until it's puppetized as an osd node I shouldn't expect it to pass the ping tests.
[15:34:01] <dhinus>	 there are a bunch of alert tasks related to the ceph issues, I think they can be all resolved? https://phabricator.wikimedia.org/project/board/2773/?filter=UH4VfZCsL3xa
[15:35:04] <dcaro>	 this one is interesting https://phabricator.wikimedia.org/T402480
[15:35:48] <dcaro>	 is conntrack in the path of osd/ceph operation? if so that might have been the trigger of the slow operations 
[15:37:29] <andrewbogott>	 Doesn't any tcp connection get added to contrack? So I'd say yes it's in the path
[15:38:30] <andrewbogott>	 we could just turn off contrack on osd nodes entirely...
[15:39:01] <dcaro>	 when contrack is full, it stops allowing new connections right?
[15:39:50] <andrewbogott>	 yes, I think so
[15:39:54] <andrewbogott>	 so that will definitely break things
[15:40:43] <dcaro>	 it's not like the subsistem that does the connection tracking though, it's just the cli right? as in, we can't disable nftables keeping track of the connections right?
[15:41:21] <dcaro>	 (I might be misunderstanding how all that works though)
[15:41:35] <andrewbogott>	 I think we can
[15:41:51] <dcaro>	 hmm, okok, then it's a good candidate to remove from there, what's the benefit of it?
[15:42:07] <andrewbogott>	 but we can also just make the table bigger
[15:42:25] <dhinus>	 grafana seems to confirm that connections are flatlining at the limit https://usercontent.irccloud-cdn.com/file/Vqlu7ozg/Screenshot%202025-08-26%20at%2017.41.31.png
[15:42:42] <dhinus>	 I also remember that we set a higher conntrack limit for cloudvirts, should we do it for ceph as well?
[15:43:48] <andrewbogott>	 here's an example for cloudvirt: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124821
[15:45:13] <dcaro>	 we should increase those yep xd
[15:45:41] <andrewbogott>	 and there's this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/675760
[15:45:47] <andrewbogott>	 'disable conntrackd'
[15:45:48] <dcaro>	 there's like a couple orders of magnitude between cloudvirts and ceph osds
[15:46:32] <andrewbogott>	 it's just OS defaults on ceph hosts now?
[15:46:36] <dhinus>	 yep there's an override in modules/profile/manifests/openstack/base/nova/compute/service.pp
[15:46:48] <dhinus>	 also for cloudgw in modules/profile/manifests/wmcs/cloudgw.pp
[15:47:31] <andrewbogott>	 what is the current limit on an osd node?
[15:47:41] <andrewbogott>	 (or, alternate question, how can I see it?)
[15:47:54] <dhinus>	 cat /proc/sys/net/nf_conntrack_max
[15:48:47] <andrewbogott>	 2^18 currently
[15:49:17] <andrewbogott>	 so if migrating 2 nodes isn't too much and 10 nodes is too much... 
[15:49:27] <andrewbogott>	 yeah, I guess we want 2^22 just like on the cloudvirts
[15:49:46] <dcaro>	 what's the use of conntrackd exactly?
[15:50:08] <dcaro>	 the home page says that it's meant to be used for high availability in firewalls
[15:50:47] <andrewbogott>	 That's probably a topranks question
[15:51:09] <andrewbogott>	 any reason to think we should increase it on mons as well?
[15:51:39] <dcaro>	 any reason not to?
[15:51:56] <dcaro>	 (main reason to increase is that there's a potential to reach the limit xd)
[15:53:35] <andrewbogott>	 It uses more RAM, I don't know why not otherwise
[15:54:17] <andrewbogott>	 waiting on CI for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182175
[15:54:49] <dcaro>	 the osds are quite ram intensive, we might want to keep an eye on that
[15:56:23] <andrewbogott>	 ping times look good on cloudcephosd1052 so I think that host is all set
[15:57:53] <dhinus>	 andrewbogott: I spotted some firmware errors on cloudcephosd1052 https://phabricator.wikimedia.org/T402938
[15:59:02] <andrewbogott>	 hm, we might not be equipped to update the intel card
[15:59:13] <andrewbogott>	 the lvm complaints are the same red herring as before right?
[16:01:14] <dhinus>	 yes
[16:01:29] <dhinus>	 they will not reoccur
[16:01:40] <dhinus>	 only when first installing the lvm2 package
[16:01:41] <andrewbogott>	 ok, I'm pushing that back to dcops to see if they have thoughts about fw version for the network card
[16:01:47] <dhinus>	 sounds good, thanks
[16:01:49] <andrewbogott>	 I'm just assuming it's the network card fw
[16:02:07] <dhinus>	 yeah I was assuming that too, but I hoped you knew more :)
[16:03:35] <dcaro>	 from what I'm reading, I suspect that conntrackd (the daemon) is just exposing the nf-tables stats, and net.netfilter.nf_conntrack_max applies to nftables, that have that connection tracking internally, so disabling the daemon might not disable the connection tracking (topranks maybe can correct me here xd)
[16:04:15] <andrewbogott>	 hm, ok
[16:04:16] <dcaro>	 hmm, there's a kernel module `nf_conntrack`
[16:04:26] <dcaro>	 so maybe that's what can be disabled
[16:04:29] <andrewbogott>	 so then raising the limit to something comically high is the best option
[16:04:52] <dcaro>	 (maybe that's what that puppet patch does also in the end xd)
[16:05:15] <dhinus>	 +1 for raising, we do it on several other hosts and it has not given problems
[16:05:16] <taavi>	 disabling nf_conntract entirely would be a very bad idea as that would stop return traffic from various outbound connections
[16:05:41] <topranks>	 yes.... unless you put specific rules in to allow specific source ports etc 
[16:05:59] <andrewbogott>	 that sounds bad then :)
[16:06:01] <taavi>	 the other option in addition to raising it is to set the notrack => true parameter on the specific high-volume firewall rules where that is not needed
[16:06:44] <topranks>	 cmooney@cloudcephosd1040:~$ sudo iptables -L INPUT -v --line -n | grep ESTAB
[16:06:44] <topranks>	 28     20M  149G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
[16:06:46] <andrewbogott>	 the other thing my patch sets for copy/paste reasons is net.netfilter.nf_conntrack_tcp_timeout_time_wait
[16:07:08] <topranks>	 ^^^ most linux systems will have something like this, which allows replies to connections it initiates back in, while still not blindly allowing everything in 
[16:07:28] <topranks>	 the question is why are you exceeding conntracks though, I think it defaults to 256k ?
[16:08:26] <dhinus>	 topranks: yes that's the current value. I think it spiked while shuffling data between new osds
[16:08:59] <topranks>	 there are cases where it's legitimate to up the value, but that default is _very_ high so typically nothing should hit it 
[16:09:18] <andrewbogott>	 it seemed to happen when shuffling 10 osds but not when shuffling 2
[16:09:40] <topranks>	 what sometimes happens is a system is unable to make a connection, so it tries to open another one, that fails, it tries to open another one, that fails, it just cycles like that out of control and jumps up to max the limit 
[16:10:08] <andrewbogott>	 that fits with what we were seeing, but that means conntrack could have been either cause or symptom :/
[16:10:30] <topranks>	 the thing to do is work out what is going on and why it's filling the table like that, it's so far above the baseline I don't believe the answer is going to be a higher limit, likely the feedback cycle from hell that's starting 256k connections will do 512k or a million the same way 
[16:11:06] <andrewbogott>	 unfortunately reproducing the issue means reproducing an outage
[16:12:04] <topranks>	 when did you see that spike?
[16:12:21] <topranks>	 I don't see it checking the dashboard we have for that, cloudcephosd1042 right?
[16:12:22] <topranks>	 https://grafana-rw.wikimedia.org/d/d96b72b0-22bb-4906-b7d1-b825a91197ab/server-interface-throughput-detailed?orgId=1&from=now-24h&to=now&timezone=utc&var-site=000000006&var-host=cloudcephosd1015:9100&var-host=cloudcephosd1040:9100&var-host=cloudcephosd1042:9100
[16:12:27] <andrewbogott>	 https://phabricator.wikimedia.org/T402480
[16:13:30] <dhinus>	 2025-08-21
[16:13:47] <dhinus>	 topranks: https://usercontent.irccloud-cdn.com/file/Vqlu7ozg/Screenshot(1
[16:14:12] <topranks>	 yeah I see it there alright 
[16:14:15] <topranks>	 https://grafana.wikimedia.org/goto/fFiwdPXNR
[16:15:42] <topranks>	 ^^ for whatever reason that doesn't work.  but anyway yeah whatever happened last week we need to work out what occurred 
[16:16:03] <dhinus>	 the link works for me :)
[16:16:55] <topranks>	 I don't think upping the limit or messing with the default conntrack timers is what is needed, it looks to me like one of those scenarios whereby the device gets into a feedback loop of creating new connections, and hits this limit.  But in fact the limit is acting like a circuit breaker to contain the problem 
[16:20:19] <andrewbogott>	 that is good to know! Thank you topranks, I'm added that note to the patch
[16:29:01] <dcaro>	 from ceph docs
[16:29:03] <dcaro>	 https://www.irccloud.com/pastebin/P42wvs31/
[16:29:19] <dcaro>	 https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/
[16:30:55] <dcaro>	 it also warns of maximum thread count being reached when there's many osds in a node
[16:37:19] <dcaro>	 there were a few around the limit
[16:37:23] <dcaro>	 https://usercontent.irccloud-cdn.com/file/nxtEs87I/image.png
[16:38:09] <dcaro>	 wait no, that's 10x lower than the limit xd (it was not showing all)
[16:38:25] <dcaro>	 this is all https://usercontent.irccloud-cdn.com/file/Ig3nXCxY/image.png
[16:38:46] <dcaro>	 the big peak is 1042
[16:39:22] <dcaro>	 probably it's the sum of all the smaller bumps, all connecting to 1042
[16:42:34] <dcaro>	 mons are around 4k the most, so those should be ok
[16:47:14] <dcaro>	 oh, I don't see that peak anymore? :/
[16:52:36] <dcaro>	 isn't that bump like 5 days ago? (21st Aug)
[16:53:41] <dhinus>	 dcaro: yep 21st aug 03:10 UTC
[16:53:52] <dcaro>	 so it's not related to yesterday's issue then right?
[16:54:06] <dhinus>	 I don't think so
[16:54:16] <dcaro>	 ack, I got confused
[16:54:22] <dhinus>	 I was also confused :)
[16:54:29] <dcaro>	 then it's not probably the cause for the slow ops
[16:54:36] <dcaro>	 (the ones yesterday at least)
[16:54:55] <dhinus>	 it triggered an alert on 21st aug, that we just found today looking at the "needs triage" tasks
[16:54:58] <dcaro>	 maybe that was one of the jumbo frames being blocked
[16:55:19] <dhinus>	 then I think we all got confused because we thought it was related but didn't check if the dates aligned
[16:56:02] <dcaro>	 xd
[16:56:17] <andrewbogott>	 I definitely didn't check the dates
[16:56:26] <andrewbogott>	 so we're back to having no real explanation :(
[16:56:42] <dhinus>	 :sadtrombone:
[16:57:44] <dcaro>	 hahaha, we have many non-working explanations :)
[17:13:48] <topranks>	 yes to confirm no rise in conntracks on the hosts you seen slow ops on yesterday
[17:22:26] * dhinus off
[17:24:06] * dcaro off