[07:41:33] 10netops, 06Infrastructure-Foundations, 06Traffic: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11519843 (10ayounsi) As it's not a timeout, but a TTL issue, that might match some transport link "event" causing this brief alert. VMs are now 1 extra ro... [08:01:29] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11519863 (10JAllemandou) >>! In T414460#11518808, @CDanis wrote: > > The spike a few days after the start of the month is interest... [08:52:21] 10netbox, 10netops, 06Infrastructure-Foundations: Automatically run Capirca Netbox script regularly - https://phabricator.wikimedia.org/T361549#11520056 (10ayounsi) Thanks to the latest patches, it's now possible to see if there are pending changes to be committed to the Capirca file. Just run the script wit... [12:06:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520728 (10cmooney) @VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on //d... [12:08:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520733 (10cmooney) Hmm so I was going to see if there was any difference if I did a trace to the ceph... [12:19:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520751 (10cmooney) Also @VRiley-WMF it seems this is actually a 1G RJ45 link. So let's swap the coppe... [12:55:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520980 (10cmooney) Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insigni... [13:18:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521085 (10cmooney) >>! In T414460#11518808, @CDanis wrote: > FIN_WAIT_1 is //not// supposed to stick around for longer than a minute or t... [14:09:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521279 (10cmooney) >>! In T414460#11521085, @cmooney wrote: > however surely it should try to resend the FIN, and if this state persists... [14:33:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521367 (10CDanis) >>! In T414460#11521085, @cmooney wrote: > The k8s host sent a FIN to the remote side but due to the packet-loss issue... [16:00:57] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521827 (10cmooney) The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now. So we can observe over the next... [16:05:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11521900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf1deaa2-45c3-45e8-bdad-1303b0075f87) set by pt1979@cumin2002 for 2:00:00 on... [16:35:54] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522109 (10cmooney) Ok currently seeing no loss (though that was the case when we were cordoned before the swap). ` cmoon... [16:49:13] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522186 (10ops-monitoring-bot) Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a... [16:50:10] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522190 (10VRiley-WMF) Happy to help with this. Let us know if there is anything else we can help with. [17:31:05] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522388 (10akosiaris) [17:36:13] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522428 (10cmooney) Thanks @VRiley. Happy to say we aren't seeing any loss as of yet after the node was uncordoned: ` cm... [18:17:54] 10netops, 06Infrastructure-Foundations, 06Traffic: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522518 (10ssingh) >>! In T414473#11519843, @ayounsi wrote: > As it's not a timeout, but a TTL issue, that might match some transport link "event" causin... [19:02:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11522682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc58471-31d6-4e79-ae14-124cd9a6b684) set by pt1979@cumin2002 for 1:00:00 on... [19:22:39] 10netops, 06Infrastructure-Foundations, 06Traffic: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522769 (10ssingh) This time on physical hosts: ` 14:20:36 <+icinga-wm> PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11) 14...