[07:41:33] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11519843 (10ayounsi) As it's not a timeout, but a TTL issue, that might match some transport link "event" causing this brief alert. VMs are now 1 extra ro...
[08:01:29] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11519863 (10JAllemandou) >>! In T414460#11518808, @CDanis wrote: >  > The spike a few days after the start of the month is interest...
[08:52:21] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10netbox: Automatically run Capirca Netbox script regularly - https://phabricator.wikimedia.org/T361549#11520056 (10ayounsi) Thanks to the latest patches, it's now possible to see if there are pending changes to be committed to the Capirca file. Just run the script wit...
[09:05:48] <duesen>	 Hi! I'm working on making the REST gateway return a Retry-After header when requests get rate limited. As an aside, I was going to add a default  Retry-After of 60 seconds for 503 and 504 responses, if the backend doesn't specify a value. Does that sound good? Or would it cause problems?
[09:05:48] <duesen>	 My thinking was that it would be nice to consistently return Retry-After with all 429 and 503 and 504 responses from all APIs.
[09:11:14] <fabfur>	 I think a Retry of 60s for 503 and 504 is ok, if the backend doesn't specify it correctly, it's already better than the existing situation!
[09:51:20] <wikibugs>	 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11520339 (10Clement_Goubert) >>! In T412586#11518896, @Scott_French wrote: > @Clement_Goubert @daniel - If you could provide more detail on s...
[11:05:48] <duesen>	 fabfur: thanks!
[11:41:49] <vgutierrez>	 duesen: afaik Retry-After isn't used for 504s
[11:43:27] <vgutierrez>	 see https://www.rfc-editor.org/rfc/rfc9110#status.503 VS https://www.rfc-editor.org/rfc/rfc9110#status.504
[12:06:26] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520728 (10cmooney) @VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on //d...
[12:08:04] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520733 (10cmooney) Hmm so I was going to see if there was any difference if I did a trace to the ceph...
[12:19:09] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520751 (10cmooney) Also @VRiley-WMF it seems this is actually a 1G RJ45 link.  So let's swap the coppe...
[12:33:06] <duesen>	 vgutierrez: I can remove it, but it does seem useful to me. A 504 typically seld-corrects after some time, but clients shouldn't just hammer us until it does... I find it curious that retry-after isn't specified for 504, since retrying is the only way to resolve a 504 situation...
[12:36:38] <duesen>	 I couldn't find any useful discussions on that, only hints to retry-after being intended for cases when it is possible to predict when a retry would be successful, and that clients should use incremental back-off for unpredictable transient errors...
[12:37:08] <duesen>	 Is that your thinking as well?
[12:55:32] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520980 (10cmooney) Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insigni...
[13:18:10] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521085 (10cmooney) >>! In T414460#11518808, @CDanis wrote: > FIN_WAIT_1 is //not// supposed to stick around for longer than a minute or t...
[13:29:32] <wikibugs>	 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521110 (10Vgutierrez)
[14:09:55] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521279 (10cmooney) >>! In T414460#11521085, @cmooney wrote: > however surely it should try to resend the FIN, and if this state persists...
[14:26:21] <wikibugs>	 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521320 (10Vgutierrez)
[14:32:29] <wikibugs>	 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521361 (10Vgutierrez) the headers described on https://wikitech.wikimedia.org/wiki/CDN/Backe...
[14:33:09] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521367 (10CDanis) >>! In T414460#11521085, @cmooney wrote: > The k8s host sent a FIN to the remote side but due to the packet-loss issue...
[15:16:54] <wikibugs>	 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521596 (10Vgutierrez)
[15:34:50] <wikibugs>	 06Traffic, 07Essential-Work, 05MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), 13Patch-For-Review, 06Test Kitchen (Test Kitchen (Experiment Platform Sprint 18)): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11521665 (10Sfaci) @ss...
[15:59:01] <wikibugs>	 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521815 (10Tgr) a:03Tgr
[15:59:54] <vgutierrez>	 duesen:  clients shouldn't hammer us, but if clients don't expect a Retry-After header in a 504 they won't use it and you'd be bloating the response for no reason
[16:00:25] <wikibugs>	 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521825 (10Tgr) We should also update some of the dashboards (at least the login one) with so...
[16:00:57] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521827 (10cmooney) The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now.  So we can observe over the next...
[16:02:56] <wikibugs>	 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521866 (10Tgr) >>! In T412396#11521361, @Vgutierrez wrote: > the headers described on https:...
[16:04:32] <wikibugs>	 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11521883 (10Scott_French) p:05Triage→03Low
[16:05:20] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11521900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf1deaa2-45c3-45e8-bdad-1303b0075f87) set by pt1979@cumin2002 for 2:00:00 on...
[16:18:28] <wikibugs>	 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11522035 (10Vgutierrez)
[16:35:54] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522109 (10cmooney) Ok currently seeing no loss (though that was the case when we were cordoned before the swap). ` cmoon...
[16:49:13] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522186 (10ops-monitoring-bot) Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a...
[16:50:10] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522190 (10VRiley-WMF) Happy to help with this. Let us know if there is anything else we can help with.
[17:31:05] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522388 (10akosiaris)
[17:36:13] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522428 (10cmooney) Thanks @VRiley.  Happy to say we aren't seeing any loss as of yet after the node was uncordoned: ` cm...
[18:17:54] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522518 (10ssingh) >>! In T414473#11519843, @ayounsi wrote: > As it's not a timeout, but a TTL issue, that might match some transport link "event" causin...
[18:28:03] <wikibugs>	 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522551 (10ssingh) @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service IPs `2620:0:860:ed1a::/64`, since unfortuna...
[19:02:02] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11522682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc58471-31d6-4e79-ae14-124cd9a6b684) set by pt1979@cumin2002 for 1:00:00 on...
[19:20:18] <wikibugs>	 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522747 (10taavi) 05Stalled→03Open
[19:22:39] <wikibugs>	 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522769 (10ssingh) This time on physical hosts:  ` 14:20:36 <+icinga-wm> PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11) 14...
[19:54:25] <cdanis>	 could I get a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226932 from sukhe or someone? :D
[19:59:41] <sukhe>	 cdanis: looking
[20:05:13] <cdanis>	 thanks! is that kind of change -- adding addl realserver IPs to pooled-for-other-services nodes -- spooky to roll out?
[20:05:53] <cdanis>	 (at a minimum I was thinking disable puppet on A:cp-text and then roll forward one node in magru by hand)
[20:07:29] <sukhe>	 cdanis: I think we have done a few for other nodes, but not the cp ones. in that regard, the cp ones tend to be more scary anyway but I think it should be OK?
[20:07:44] <sukhe>	 yeah, disabling puppet and a quick test on one host should tell us if things are not right
[20:07:49] <cdanis>	 coolcool
[20:08:02] <cdanis>	 and yeah if that fails or obviously messes up, I'll just roll back the first patch
[20:08:04] <sukhe>	 thanks
[20:08:31] <sukhe>	 I guess we could do one more thing but I haven't thought it through
[20:08:37] <wikibugs>	 07HTTPS, 06Traffic, 06SRE: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#11522895 (10Izno)
[20:08:38] <sukhe>	 we could simply add the override for magru itself for cp-text
[20:08:52] <sukhe>	 though I don't think realserver::pools does a merge so that probably won't work on the hash hmm
[20:08:55] <sukhe>	 ok, never mind
[20:09:53] <sukhe>	 what I was saying was that if "profile::lvs::realserver::pools" had merge => hash set in the lookup, we could set an override just for magru and add gerrit-https there
[20:10:00] <sukhe>	 that would then give us the existing ones + gerrit-https, just for magru
[20:10:03] <sukhe>	 but it's fine
[20:10:45] <cdanis>	 yeah you can override merge settings globally I think
[20:10:53] <cdanis>	 if you do it from the appropriate level of hiera resolution
[20:13:54] <sukhe>	 yep. but that is if we really want to restrict this but since we are doing magru + global, let's just go ahead
[20:15:39] <sukhe>	 v.g. might disagree so if you want to wait for him, that's also fine (I trust him more than I trust myself anyway)
[20:17:07] <cdanis>	 🤠
[20:21:37] <cdanis>	 swfrench-wmf: ChrisDobbins901_: about to mess with just one cp host in magru, and then potentially all of them, not expecting trouble just fyi
[20:21:55] * swfrench-wmf thumbs up
[20:28:40] <cdanis>	 cool, that was totally hitless afaict
[20:30:16] <sukhe>	 https://puppetboard.wikimedia.org/report/cp7001.magru.wmnet/74b629f35c05395b0fcd8bf21e5727e4ca0b4891 
[20:30:20] <sukhe>	 looks good I think
[20:30:37] <cdanis>	 yeah I was watching a bunch of stuff live
[20:30:51] <sukhe>	 the s/text-https/gerrit-https thing is throwing me off but we will learn to live with it
[20:31:32] <cdanis>	 yeah ... I think that was unavoidable with this approach unfortunately
[20:31:38] <jinxer-wm>	 FIRING: [2x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:31:43] <sukhe>	 hmm
[20:33:13] <cdanis>	 that might just be a race condition?
[20:33:20] <sukhe>	 that's probably because of the switch but looks good otherwise
[20:35:03] <sukhe>	 cdanis: I guess if we really really want to be sure, maybe you can try it on more host (the one you are hitting in eqiad) before rolling out
[20:36:17] <sukhe>	 stepping out for school pickup but will be back in 10ish.
[20:37:18] <cdanis>	 I'm pretty confident, aside from the MSS issue
[20:41:10] <sukhe>	 try it one more host and see if it happens. in some ways, that might just be from the new gerrit lb connection being established
[20:43:09] <sukhe>	 if not then yeah we can rollback and do it in the morning when vg is around. but I think that was a false positive, much like we see during a reboot 
[20:43:12] <sukhe>	 > If for some reason the kernel is unable to answer to the initial SYN packet or it answers with an RST packet, this alert will be trigger a false positive.
[20:44:14] <cdanis>	 yeah
[20:44:22] <cdanis>	 I continued with all of cp-text in magru
[20:44:53] <cdanis>	 going to enable-puppet on cp-text globally soon
[20:45:15] <cdanis>	 and then going to mess with LVS in magru
[20:46:28] <cdanis>	 enabling puppet now, going to just let it roll because it did look hitless
[20:46:38] <jinxer-wm>	 FIRING: [12x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:46:49] <cdanis>	 ah
[20:48:36] <vgutierrez>	 what's going on?
[20:48:46] <cdanis>	 vgutierrez: I'm rolling back a patch :3
[20:48:51] <vgutierrez>	 that's no good 
[20:48:58] <cdanis>	 vgutierrez: it's only for the new gerrit VIP
[20:49:18] <vgutierrez>	 ohh
[20:49:31] <cdanis>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226932
[20:50:57] <cdanis>	      Loaded: loaded (/lib/systemd/system/tcp-mss-clamper.service; enabled; vendor preset: enabled)
[20:50:58] <cdanis>	      Active: active (running) since Tue 2025-10-14 13:40:57 UTC; 3 months 0 days ago
[20:51:35] <cdanis>	 I guess we're missing a refresh dependency in puppet somewhere
[20:51:38] <jinxer-wm>	 FIRING: [16x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:51:42] <vgutierrez>	 yeah... tcp mss clamper needs to be manually restarted
[20:51:43] <cdanis>	 └─1968 /usr/bin/tcp-mss-clamper --ipv4-mss 1440 --ipv6-mss 1400 -p :2200 -s 195.200.68.224:443,195.200.68.224:80,[2a02:ec80:700:ed1a::1]:443,[2a02:ec80:700:ed1a::1]:80 -i eno12399np0,lo
[20:51:45] <cdanis>	 before
[20:51:51] <vgutierrez>	 and that requires a depool
[20:51:58] <vgutierrez>	 that's by design
[20:52:11] <cdanis>	 oh
[20:52:39] <sukhe>	 well that's a big TIL
[20:52:43] <cdanis>	 same
[20:52:51] <sukhe>	 we should document it if not already
[20:52:56] <vgutierrez>	 or limitation given adding vips to realservers doesn't happen too often 
[20:53:00] <sukhe>	 been a while since we did this
[20:53:06] <sukhe>	 yeah I guess 
[20:53:33] <cdanis>	 why does it require a depool
[20:53:54] <cdanis>	 because uh I had already done it to all magru just before you said that
[20:54:08] <sukhe>	 cdanis: let me know if you need an extra pair of hands. I will be online soon
[20:54:11] <cdanis>	 thanks <3
[20:54:53] <vgutierrez>	 because restarting tcp mss clamper will stop mss clamping for a few milliseconds 
[20:54:59] <cdanis>	 oh
[20:55:07] <vgutierrez>	 and connections accepted during that time won't be clamped
[20:55:28] <cdanis>	 are you sure that's actually less impactful than a depool?
[20:55:35] <cdanis>	 er, more
[20:56:38] <jinxer-wm>	 FIRING: [24x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.225:443 @ cp6009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[20:57:25] <cdanis>	 re-disabled puppet on A:cp-text
[20:57:30] <vgutierrez>	 hmmm not really
[20:57:40] <vgutierrez>	 it should be really fast
[20:58:24] <cdanis>	 I did it percussively on magru
[20:58:34] <cdanis>	 and there's no obvious impact 😅
[20:59:09] <vgutierrez>	 if katran drops connections we know why :)
[20:59:33] <cdanis>	 I would never blame you for my own yeehaw
[21:00:23] <cdanis>	 ok I'm going to continue then
[21:00:25] <cdanis>	 thanks vg <3
[21:01:38] <jinxer-wm>	 FIRING: [40x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5023 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:03:01] <vgutierrez>	 np
[21:06:38] <jinxer-wm>	 FIRING: [46x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5020 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:09:34] <sukhe>	 cdanis: here now. can I help? sorry :D
[21:10:23] <cdanis>	 sukhe: seems like that restarting tcp-mss-clamper doesn't strictly require a depool
[21:10:55] <cdanis>	 Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="control plane is now aware of the current status of all realservers" service=gerrit-httpslb6_443
[21:10:57] <cdanis>	 Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="new healthcheck result received" service=gerrit-httpslb_443 hostname=cp7002.magru.wmnet address=10.140.1.4 healthcheck_name=HTTPCheck healthcheck_id=2911427075 healthcheck_result=true
[21:10:59] <cdanis>	 Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="control plane is now aware of the current status of all realservers" service=gerrit-httpslb_443
[21:11:01] <cdanis>	 very cool :D
[21:11:38] <jinxer-wm>	 FIRING: [78x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5020 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:12:09] <cdanis>	 those should clear soon
[21:12:39] <cdanis>	 sukhe: I think I will do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215398 and then stop today ?
[21:13:22] <sukhe>	 if you are ready :D
[21:16:38] <jinxer-wm>	 FIRING: [78x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:19:36] <cdanis>	 ok, I'll stop for now, we can do drmrs/the world tomorrow
[21:19:40] <cdanis>	 thanks for all the help!
[21:20:01] <sukhe>	 haha, you did most of it. sorry about not knowing about the restart part
[21:20:12] <sukhe>	 all good on the reloads in magru?
[21:20:17] <cdanis>	 yep!
[21:20:19] <sukhe>	 nice
[21:20:21] <cdanis>	 on lvs7003 and 7001
[21:20:25] <sukhe>	 yep sounds good
[21:20:26] <cdanis>	 that part was very easy
[21:20:58] <sukhe>	 yep, it's very nice
[21:21:38] <jinxer-wm>	 RESOLVED: [84x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:23:14] <sukhe>	 no pending alerts, we are all clear
[21:23:33] <cdanis>	 and -- it works!
[21:24:08] <cdanis>	 well hm.  port 443 works
[21:24:20] <sukhe>	 that is all that should work though right 
[21:24:31] <cdanis>	 I thought the tcp-proxies were ready to accept traffic
[21:24:51] <cdanis>	 I'll check, but not a concern rn
[21:25:40] <sukhe>	 I vaguely recall that something was left but let me check the task
[21:25:41] <sukhe>	 yeah
[21:26:21] <sukhe>	 nope, https://phabricator.wikimedia.org/T408064
[21:26:36] <sukhe>	 the magru proxies certainly should be up 
[21:27:26] <sukhe>	 there was an issue with their provisioning, and what I had in mind, but it seems like per the above, everything was done
[21:27:34] <cdanis>	 `liberica cp services` shows empty pools for the gerrit 29418 services
[21:27:56] <cdanis>	 so I'm missing something in the puppet side
[21:30:05] <sukhe>	 indeed
[21:30:26] <sukhe>	       class: high-traffic1
[21:30:26] <sukhe>	       conftool:
[21:30:26] <sukhe>	         cluster: tcp-proxy
[21:30:26] <sukhe>	         service: gerrit
[21:31:01] <cdanis>	 I thought service there was supposed to match
[21:31:07] <cdanis>	 profile::lvs::realserver::pools:
[21:31:09] <cdanis>	   gerrit-ssh:
[21:31:11] <cdanis>	     services:
[21:31:13] <cdanis>	       - gerrit
[21:32:05] <cdanis>	 and also uh
[21:32:08] <cdanis>	 conftool-data/node/magru.yaml
[21:32:09] <cdanis>	 26:  tcp-proxy:
[21:32:11] <cdanis>	 27:    tcp-proxy7001.magru.wmnet: [gerrit]
[21:32:13] <cdanis>	 28:    tcp-proxy7002.magru.wmnet: [gerrit]
[21:32:19] <sukhe>	 er also
[21:32:23] <sukhe>	 these are in service_setup?
[21:32:28] <cdanis>	 lvs_setup ?
[21:32:56] <sukhe>	 yeah sorry, updated
[21:33:01] <sukhe>	 so that matches at least
[21:33:41] <sukhe>	 this is in the look at it from scratch stage now
[21:33:51] <sukhe>	 basically start from https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service and see what we are missing
[21:34:05] <sukhe>	 [except the instructions are for eqiad/codfw, but besides that]
[21:35:23] <cdanis>	 lol https://config-master.wikimedia.org/pybal/magru/gerrit-ssh
[21:35:35] <sukhe>	 hahaha
[21:36:49] <sukhe>	 sukhe@puppetserver1001:~$ sudo confctl select 'cluster=tcp-proxy' get    
[21:36:57] <sukhe>	 might as well enable it all
[21:37:08] <jinxer-wm>	 FIRING: [32x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5024 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert  - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[21:37:15] <cdanis>	 gerrit-sshlb_29418:
[21:37:17] <cdanis>	 	10.140.2.10	1 healthy: false | pooled: depool-blocked
[21:37:21] <cdanis>	 hey what really
[21:38:33] <sukhe>	 	2a02:ec80:700:103:10:140:2:10	1 healthy: false | pooled: depool-blocked
[21:38:46] <sukhe>	 probably meaning that the depool threshold is keeping it alive?
[21:38:52] <cdanis>	 oh I meant the MSS alerts
[21:38:54] <cdanis>	 but yeah
[21:39:11] <sukhe>	 the MSS alert might just be delayed (sigh) since I can't find anything in the alertmanager
[21:39:13] <cdanis>	 they look to be for gerrit-lb again
[21:41:17] <sukhe>	 I am not sure why it says not healthy though
[21:41:18] <sukhe>	 sukhe@lvs7001:~$ nc tcp-proxy7001.magru.wmnet 29418
[21:41:18] <sukhe>	 SSH-2.0-GerritCodeReview_3.10.6 (APACHE-SSHD-2.12.0)
[21:41:31] <sukhe>	 let's pool both in magru?
[21:41:42] <sukhe>	 the v6 is unhappy
[21:41:54] <cdanis>	 ahh
[21:41:56] <cdanis>	 hm
[21:42:36] <cdanis>	 LISTEN                      0                           1024                                                           *:29418                                                        *:*                          users:(("haproxy",pid=696,fd=9))
[21:42:57] <cdanis>	 do we have to do something silly like have it listen on both 0.0.0.0 and [::]
[21:43:11] <sukhe>	 except that sukhe@lvs7001:~$ nc -6 tcp-proxy7001.magru.wmnet 29418
[21:43:13] <sukhe>	  looks happy
[21:43:22] <sukhe>	 and that's kinda all of the healthcheck anyway?
[21:43:34] <cdanis>	 maybe
[21:43:36] <sukhe>	 there is no proxyfetch here (doesn't need to be)
[21:44:10] <sukhe>	 Jan 14 21:41:52 lvs7001 libericad[1436749]: time=2026-01-14T21:41:52.947Z level=WARN msg="unable to depool due to depool threshold enforcement" service=gerrit-sshlb6_29418 hostname=tcp-proxy7002.magru.wmnet address=2a02:ec80:700:103:10:14>
[21:44:16] <sukhe>	 so that confirms why it says depool-blocked
[21:44:19] <sukhe>	 not why it can't reach it though
[21:45:19] <sukhe>	 this is where I vaguely recall something wrong with the magru hosts related to v6
[21:46:08] <sukhe>	 but yeah let's look at this tomorrow now, other than the alert, nothing really is breaking
[21:46:28] <cdanis>	 isn't v4 ssh unhealthy too?
[21:46:43] <cdanis>	 gerrit-sshlb_29418:
[21:46:45] <cdanis>	 	10.140.2.11	1 healthy: false | pooled: depool-blocked
[21:46:46] <sukhe>	 yep, it is. I only remember a v6 issue though, during provisioning
[21:46:47] <cdanis>	 	10.140.2.10	1 healthy: false | pooled: depool-blocked
[21:46:49] <cdanis>	 yeah
[21:47:06] <cdanis>	 we haven't validated that haproxy can receive ip-tunnelled ssh yet :)
[21:47:10] <cdanis>	 tcp-proxy haproxy, that is
[21:47:23] <cdanis>	 anyway, stopping for now
[21:47:25] <cdanis>	 thanks again!
[21:47:26] <sukhe>	 yep :)
[21:48:18] <sukhe>	 we can change the LVS state to service_setup for the gerrit related alerts if we want to silence them
[21:48:21] <sukhe>	 I will check later
[21:50:17] <vgutierrez>	 hmm are the instances ready to receive IPIP traffic?
[21:50:34] <vgutierrez>	 I can check that tomorrow morning 
[21:52:11] <vgutierrez>	 https://gitlab.wikimedia.org/-/snippets/107 should do it
[21:53:48] <sukhe>	 nice. please go offline no
[21:53:49] <sukhe>	 w
[21:58:10] <vgutierrez>	 if ipip0 and ipip60 are there
[21:58:15] <cdanis>	 vgutierrez: buying some beer for you in Lisbon :D
[21:58:20] <vgutierrez>	 you're probably missing rhe fw rules
[21:58:43] <cdanis>	 ahhh
[21:58:50] <vgutierrez>	 to allow inbound traffic from the IP space we use for IPIP
[21:59:09] <vgutierrez>	 check ncredir puppetization
[21:59:21] * vgutierrez going to sleep now